STAT 333 Course Note

Table of Contents

1. Fundamental of Probability

1.1. What's Probability

1.1.1. Examples

  1. Coin toss
  2. Roll a dice
  3. Tomorrow weather
  4. Randomly pick a number in [0,1][0, 1]

Although things are random, they are not haphazard/arbitrary. There are "patterns"

Example 1

If we repeat tossing a coin, then the fraction of times that we get a "H" goes to 12\frac{1}{2} as the number of toss goes to infinity.

#  of  "H"total  #  of  toss=12\frac{\#\; of \;"H"}{total\; \#\; of\; toss} = \frac{1}{2}

This number 1/21/2 reflects how "likely" a "H" will appear in one toss (if the experiment is not repeated)

1.2. Probability Models

The Sample space Ω\Omega is the set consisting of al the possible outcomes of a random experiment.

1.2.1. Examples

  1. {H,T}\{H, T\}
  2. {1,2,3,4,5,6}\{1,2,3,4,5,6\}
  3. {sunny,rainy,cloudy,...}\{sunny, rainy, cloudy, ...\}
  4. [0,1][0, 1]

An event EΩE\in \Omega is a subset of Ω\Omega

for which we can talk about "likelihood of happening"; for example

We say an event EE "happens", if the result of the experiment turns out to belong to EE (a subset of Ω\Omega)

A probability PP is a set function ( a mapping from events to real numbers)

P:ξREP(E)\begin{aligned} P: & \xi \rightarrow R \\ & E \rightarrow P(E) \end{aligned}

which satisfies the following 3 properties:

  1. Eξ,0P(E)1\forall E \in \xi, 0 \leq P(E) \leq 1
  2. P(Ω)=1P(\Omega) = 1
  3. For

Intuitively, one can think the probability of an event as the "likelihood/chance" for the event happens. If we repeat the experiment for a large number of events, the probability is the fraction of time that the event happens

P(E)=limn# of times the E happens in n trialsnP(E) = \lim_{n\rightarrow\infty} \frac{\text{\# of times the E happens in n trials}}{n}

1.2.1.1. Example 2

P({1})=P({2})==P({6})=16E={even number}={2,4,6}    P(E)=P({2}P({4}))P({6})=12\begin{aligned} &P(\{1\})=P(\{2\})=\ldots=P(\{6\})=\frac{1}{6} \\ &E = \{\text{even number}\} = \{2,4,6\} \\ \Rightarrow \;\; &P(E) = P(\{2\}\cup P(\{4\})) \cup P(\{6\}) = \frac{1}{2} \end{aligned}

Properties of probability:

  1. P(E)+P(Ec)=1P(E) + P(E^c) = 1
  2. P()=0P(\emptyset)=0
  3. E1E2P(E1)P(E2)E_1\subseteq E_2 \Rightarrow P(E_1)\leq P(E_2)
  4. P(E1E2)=P(E1)+P(E2)P(E1E2)P(E_1 \cup E_2) = P(E_1) + P(E_2) - P(E_1 \cap E_2) -P(E1E2)P(E_1 \cap E_2): E1E_1 and E2E_2 happen

1.2.2. Remark: why do we need the notion of event?

If the sample space Ω\Omega is discrete, then everything can has at most countable elements be built from the "atoms"

Ω={w1,w2,}P(w1)=PiPi[0,1],i=1Pi=1\begin{aligned} \Omega = \{w_1, w_2, \ldots\} \\ P(w_1) = P_i \\ P_i \in [0, 1], \sum_{i=1}^\infty P_i = 1 \end{aligned}

Then for any event E={w1,iI}E=\{w_1, i \in I\}, P(E)=iIPiP(E) = \sum_{i \in I}P_i

However, if the sample space Ω\Omega is continuous; e.g, [0,1][0,1] in Example 4, then such a construction can not be done for any x[0,1]x\in [0, 1] we get P({x}=0P(\{x\} = 0 (xx: the point is exactly xx)

We can not get P([0,13])P([0, \frac{1}{3}]) by adding P({x})P(\{x\}) for x13x\leq \frac{1}{3}.

This is why we need the notion of event; and we define PP as a set function from ξ\xi to RR rather than a function from Ω\Omega to RR

To summarize: A Probability Space consists of a triplet (Ω,ξ,P)(\Omega, \xi, P):

1.3. Conditional Probability

If we know some information, the probability of an event can be updated

Let EE, FF be two events P(F)>0P(F) > 0

The conditional probability of EE, given FF is

P(EF)=P(EF)P(F)P(E\mid F) = \frac{P(E \cap F)}{P(F)}

Again, think probability as the long-run frequency:

P(EF)=limn#of  times  E  and  F  happen  in  n  trailsnP(F)=limn#of  times  F  happen  in  n  trailsnP(EF)P(F)=limn#of  times  E  and  F  happen#of  times  F  happens\begin{aligned} P(E \cap F) &= \lim_{n\rightarrow\infty}\frac{\# of\; times\; E\; and\; F\; happen\; in\; n\; trails}{n} \\ P(F) &= \lim_{n\rightarrow\infty}\frac{\# of\; times\; F\; happen\; in\; n\; trails}{n} \\ \Rightarrow \frac{P(E\cap F)}{P(F)} &= \lim_{n\rightarrow\infty}\frac{\# of\; times\; E\; and\; F\; happen}{\# of \;times \;F \;happens} \end{aligned}

By definition

P(EF)=P(EF)P(F)P(E\cap F) = P(E\mid F) \cdot P(F)

1.4. Independence

Def: Two events EE and FF are said to be independent, if P(EF)=P(E)P(F)P(E\cap F)=P(E)\cdot P(F); denoted as EFE\perp\!\!\!\perp F. This is different rom disjoint.

Assume P(F)>0P(F)>0, then EFP(EF)=P(E)E\perp\!\!\!\perp F \Leftrightarrow P(E|F)=P(E); intuitively, knowing FF does not change the probability of EE.

Proof:

EFP(EF)=P(E)P(F)P(EF)P(F)=P(E)P(EF)=P(E)\begin{aligned} E\perp\!\!\!\perp F & \Leftrightarrow P(E\cap F) = P(E)\cdot P(F) \\ & \Leftrightarrow \frac{P(E\cap F)}{P(F)} = P(E) \\ & \Leftrightarrow P(E|F) = P(E) \end{aligned}

More generally, a sequence of events E1,E2,E_1, E_2, \ldots are called independent if for any finite index set II,

P(iIEi)=iIP(Ei)P(\bigcap_{i\in I}E_i)=\prod_{i\in I} P(E_i)

1.5. Bayes' rule and law of total probability

Theorem: Let F1,F2,F_1, F_2,\ldots be disjoint events, and i=1Fi=Ω\bigcap_{i=1}^\infty F_i=\Omega, we say {Fi}i=1\{F_i\}_{i=1}^\infty forms a "partition" of the sample space Ω\Omega

Then P(E)=i=1P(EFi)P(Fi)P(E)=\sum_{i=1}^\infty P(E|F_i)\cdot P(F_i)

Proof: Exercise

Intuition: Decompose the total probability into different cases.

P(EF2)=P(EF2)P(F2)P(E\cap F_2) = P(E|F_2)\cdot P(F_2)

1.5.1. Bayes' rule

P(FiE)=P(EFi)P(Fi)j=1P(EFj)P(Fj)P(F_i | E) = \frac{P(E|F_i)\cdot P(F_i)}{\sum_{j=1}^\infty P(E|F_j)\cdot P(F_j)}

Bayes' rule tells us how to find conditional probability by switching the role of the event and the condition.

Proof:

P(FiE)=P(FiE)P(E)definition of condition probability=P(EFi)P(Fi)P(E)=P(EFi)P(Fi)j=1P(EFj)P(Fj)law of total probability\begin{aligned} P(F_i | E) & = \frac{P(F_i\cap E)}{P(E)} & \hspace{1em}\text{definition of condition probability} \\ & = \frac{P(E|F_i)P(F_i)}{P(E)} \\ & = \frac{P(E|F_i)P(F_i)}{\sum_{j=1}^\infty P(E|F_j)P(F_j)} & \text{law of total probability} \end{aligned}

2. Random variables and distributions

2.1. Random variables

(Ω,ξ,P)(\Omega,\xi, P): Probability space.

Definition: A random variable XX (or r.v.) is a mapping from Ω\Omega to R\R

X:ΩRωX(ω)\begin{aligned} X : & \Omega\rightarrow \R \\ & \omega \rightarrow X(\omega) \end{aligned}

A random variable transforms arbitrary "outcomes" into numbers.

XX introduces a probability on RR. For ARA\subseteq R, define

P(XA):=P({X(ω)A})=P({ω:X(ω)A})=P(X1(A))\begin{aligned} P(X\in A) :&= P(\{X(\omega)\in A\}) \\ &= P(\{\omega:X(\omega)\in A\}) \\ &= P(X^{-1}(A)) \end{aligned}

From now on, we can often "forget" te original probability space and focus on the random variables and their distributions.

Definition: let XX be a random variable. The CDF(cumulative distribution function) FF of XX is defined by

F(x)=P(Xx)=P(X(,x]))X:random variable,x:numberF(x) = P(X\leq x) = P(X\in(-\infty, x])) \\ X: \text{random variable}, x: \text{number}

Properties of cdf:

  1. FF is non-decreasing. F(x1)F(x2),x1<x2F(x_1)\leq F(x_2), x_1 < x_2
  2. limits
  3. F(x)F(x) is right continuous

2.2. Discrete random variables and distributions

A random variable XX is called discrete if it only takes values in an at most countable set {x1,x2,}\{x_1, x_2, \ldots\} (finite or countable).

The distribution of a discrete random variable is fully characterized by its probability mass function(p.m.f)

p(x):=P(X=x);x=x1,x2, p(x):=P(X=x); x=x_1, x_2, \ldots

Properties of pmf:

  1. p(x)0,    xp(x)\geq 0, \;\;\forall x
  2. ip(xi)=1\sum_ip(x_i) = 1

Q: what does the cdf of a discrete random variable look like?

2.2.1. Examples of discrete distributions

1. Bemoulli distribution

p(1)=P(X=1)=pp(c)=P(X=c)=1pp(x)=0otherwise\begin{aligned} p(1) &= P(X=1)=p \\ p(c) &= P(X=c) = 1-p \\ p(x) &= 0 \quad otherwise \end{aligned}

Denote XBer(p)X\sim Ber(p)

2.Binomial distribution

p(k)=P(X=k)=knpk(1p)nk\begin{aligned} p(k) = P(X=k) = \lgroup\stackrel{n}{k}\rgroup p^k(1-p)^{n-k} \end{aligned}

3.Geometric distribution

p(k)=P(X=k)=(1p)k1pp(k) = P(X=k)=(1-p)^{k-1}p \\

(1p)k1:the first k-1 trials are all failures,    p:success in kth trial(1-p)^{k-1}: \text{the first k-1 trials are all failures},\;\; p: \text{success in }k^{th}\text{ trial}

Memoryless property:

p(X>n+mX>m)=P(X>n)p(X>n+m|X>m)=P(X>n)

Proof:

P(X>k)=j=k+1P(X=j)=j=k+1(1p)j1p=(1p)kp11(1p)=(1p)k\begin{aligned} P(X>k) & = \sum_{j=k+1}^\infty P(X=j) \\ & = \sum_{j=k+1}^\infty (1-p)^{j-1}p \\ & = (1-p)^kp \cdot \frac{1}{1-(1-p)} \\ & = (1-p)^k \\ \end{aligned}

P(X>n+mx>m)=P(X>n+mX>m)P(X>m)=P(X>n+m)P(X>m)=(1p)n+m(1p)m=(1p)n=P(X>n)\begin{aligned} P(X>n+m|x>m) &= \frac{P(X>n+m\bigcap X>m)}{P(X>m)} \\ &= \frac{P(X>n+m)}{P(X>m)} = \frac{(1-p)^{n+m}}{(1-p)^m} = (1-p)^n=P(X>n) \end{aligned}

Intuition: The failures in the past have no influence on how long we still need to wait to get the first success in the future

4. Poisson distribution

p(k)=P(X=k)=λkeλk!,k=0,1,2,,λ>0p(k)=P(X=k)=\frac{\lambda^k e^{-\lambda}}{k!}, k=0,1,2,\ldots, \lambda > 0

Other discrete distributions:

2.3. Continuous random variables and distributions

Definition: A random variable XX is called continuous if there exists a non-negative function ff, such that for any interval [a,b][a,b], (a,b)(a,b) or [a,b)[a,b):

P(X[a,b])=abf(x)dxP(X\in[a,b]) = \int_a^b f(x)dx

The function ff is called the probability density function(pdf) of XX

Remark: probability density function(pdf) is not probability. P(X=x)=0P(X=x)=0 if XX is continuous. The probability density function ff only gives probability when it is integrated.

If XX is continuous, then we can get cdf by:

F(a)=P(X(,a])=af(x)dxF(a)=P(X\in(-\infty, a])=\int^a_{-\infty} f(x)dx

hence, F(x)F(x) is continuous, and differentiable "almost everywhere".

We can take f(x)=F(x)f(x)=F'(x) when the derivative exists, and f(x)=f(x)=arbitrary number otherwise often to choose a value to make ff have some continuity.

Property of pdf:

  1. f(x)0f(x)\leq 0, xRx\in R
  2. f(x)dx=1\int_{-\infty}^\infty f(x)dx = 1
  3. For AR,P(XA)=Af(x)dxA\subseteq R, P(X\in A)=\int_A f(x)dx

2.3.1. Example of continuous distribution

Exponential distribution

f(x)={λeλx,x00,x0XExp(x)f(x)= \begin{cases} \begin{aligned} &\lambda e^{-\lambda x} &, x\geq 0 \\ &0 &, x\leq 0 \end{aligned} \end{cases} \\ X\sim Exp(x)

Other continuous distributions:

Exercises:

  1. Find the cdf of XExp(x)X\sim Exp(x)

    F(k)=P(Xk)=kf(x)dx=0kλeλxdx=eλx0k=eλk(e0)=1eλk\begin{aligned} F(k) = P(X\leq k) &= \int_{-\infty}^k f(x)dx \\ &= \int_0^k \lambda e^{-\lambda x}dx \\ &= -e^{-\lambda x} \Big|^k_0 \\ &= -e^{-\lambda k} - (-e^0) \\ &= 1 - e^{-\lambda k} \end{aligned}

  2. Show that the exponential distribution has the memoryless property:

    P(X>t+sX>t)=P(X>s)P(X>t+s|X>t)=P(X>s)

2.4. Joint distribution of r.v's

Let XX and YY be two r.v's. defined on the same probability space (Ω,ξ,P)(\Omega, \xi, P)

For each ωΩ\omega \in \Omega, we have at the same time X(ω)X(\omega) and Y(ω)Y(\omega). Then we can talk about the joint behavior of XX and YY

Two joint distribution of r.v's is characterized by joint cdf, joint pmf(discrete case) or joint pdf(continuous case).

Definition: Two r.v's XX and YY are called independent, if for all sets A,BRA,B\subseteq R,

P(X<A,Y<B)=P(XA)P(YB)P(X<A,Y<B)=P(X\in A)\cdot P(Y\in B)

({XA}\{X\in A\} and {YB}\{Y\in B\} are independent events)

Theorem: Two r.v's XX and YY are

  1. independent, if and only if
  2. F(x,y)=Fx(x)Fy(y);x,yRF(x,y)=F_x(x)F_y(y); x,y\in R; where FxF_x: cdf of x; FyF_y: cdf of y
  3. f(x,y)=fx(x)fy(y);x,yRf(x,y)=f_x(x)f_y(y); x,y\in R; where ff is the joint pmf/pdf of XX and YY; fxf_x, fyf_y are marginal pmf/pdf of XX and YY, respectively

Proof:

1.\Rightarrow 2.

If XYX \perp\!\!\!\perp Y, then by definition,

F(x,y)=P(X(,x],Y(,y]))=P(X(,x]))P(Y(,y]))=Fx(x)Fy(y)F(x,y)=P(X\in(-\infty, x],Y\in(-\infty, y])) = P(X\in(-\infty, x]))\cdot P(Y\in(-\infty,y])) = F_x(x)F_y(y)

2.\Rightarrow 3.

Assume F(x,y)=Fx(x)Fy(y)F(x,y)=F_x(x)\cdot F_y(y),

f(x,y)=2xyF(x,y)=2xyFx(x)Fy(y)=(xFx(x))(yFy(y))=fx(x)fy(y)\begin{aligned} f(x,y)=\frac{\partial^2}{\partial x \partial y}F(x,y) & = \frac{\partial^2}{\partial x \partial y} F_x(x)F_y(y) \\ &= (\frac{\partial}{\partial x}F_x(x))(\frac{\partial}{\partial y}F_y(y)) \\ &= f_x(x)f_y(y) \end{aligned}

3.\Rightarrow 1.

Assume f(x,y)=fx(x)fy(y)f(x,y)=f_x(x)f_y(y); For A,BRA,B\subseteq R,

P(XA,YB)=yBxAf(x,y)dxdy=yBxAfx(x)fy(y)dxdy=(xAfx(x)dx)(yBfy(y)dy)=P(XA)P(YB)\begin{aligned} P(X\in A, Y\in B) &= \int_{y\in B}\int_{x\in A} f(x,y) dxdy \\ &= \int_{y\in B}\int_{x\in A} f_x(x)f_y(y)dxdy \\ &= (\int_{x\in A} f_x(x)dx) (\int_{y\in B} f_y(y)dy) \\ &= P(X\in A)P(Y\in B) \end{aligned}

2.5. Expectation

Definition: For a r.v XX, the expectation of g(x)g(x) is defined as

E(g(X))={i=1g(xi)P(X=xi)for discrete Xg(x)f(x)dxfor continuousX\mathbb{E}(g(X))= \begin{cases} \begin{aligned} & \sum_{i=1}^\infty g(x_i)P(X=x_i) & \text{for discrete } X \\ & \int_{-\infty}^\infty g(x)f(x)dx & \text{for continuous} X \end{aligned} \end{cases}

Let XX,YY be two r.v's; then the expectation of g(X,Y)g(X,Y) is defined in a similar way.

E(g(X,Y))={ijg(xi,yj)P(X=xi,Y=yj)g(xi,yj)f(x,y)dxdy\mathbb{E}(g(X,Y)) = \begin{cases} \begin{aligned} &\sum_i\sum_j g(x_i,y_j)P(X=x_i, Y=y_j)\\ & \int\int g(x_i, y_j)f(x,y)dxdy \end{aligned} \end{cases}

2.5.1. Properties of expectation

  1. Linearity:expectation of XX: E(X)={xiP(X=xi)xf(x)dx\mathbb{E}(X)= \begin{cases} \sum x_i P(X=x_i) \\ \int_{-\infty}^{\infty} xf(x)dx \end{cases}, g(X)=xg(X)=x
  2. If XYX\perp \!\!\! \perp Y, then E(g(X)h(Y))=E(g(X))E(h(Y))\mathbb{E}(g(X)h(Y))=\mathbb{E}(g(X))\cdot \mathbb{E}(h(Y))

2.5.2. Definitions

Definition: The expectation E(Xn)\mathbb{E}(X^n) is called the n-th moment of XX:

Definition: The variance of a r.v XX is defined as:

Var(x)=E((XE(X))2) also denoted as σ2,σx2Var(x)=\mathbb{E}((X-\mathbb{E}(X))^2) \text{ also denoted as } \sigma^2, \sigma_x^2

Definition: the covariance of the r.v's XX and YY is defined as:

Cov(X,Y)=E[(XE(X))(YE(Y))]Cov(X,Y)=\mathbb{E}[(X-\mathbb{E}(X))(Y-\mathbb{E}(Y))]

Thus Var(X)=Cov(X,X)Var(X)=Cov(X,X)

Definition: the correlation between XX and YY is defined as:

Cor(X,Y)=Cov(X,Y)Var(X)Var(Y)Cor(X,Y)=\frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}

Fact: Var(X)=E(X2)(E(X))2Var(X)=\mathbb{E}(X^2)-(\mathbb{E}(X))^2

Proof:

Var(X)=E((XE(X))2)=E(X22XE(X)+(E(X))2)=E(X2)2E(XE(X))+(E(X))2=E(X2)2(E(X))2+(E(X))2=E(X2)(E(X))2\begin{aligned} Var(X) &= \mathbb{E}((X-\mathbb{E}(X))^2) \\ &= \mathbb{E}(X^2-2X\mathbb{E}(X)+(\mathbb{E}(X))^2) \\ &= \mathbb{E}(X^2)-2\mathbb{E}(X\mathbb{E}(X))+(\mathbb{E}(X))^2 \\ &= \mathbb{E}(X^2)-2(\mathbb{E}(X))^2+(\mathbb{E}(X))^2 \\ &= \mathbb{E}(X^2)-(\mathbb{E}(X))^2 \quad\quad\blacksquare \end{aligned}

Fact: Cov(X,Y)=E(XY)E(X)E(Y)Cov(X,Y)=\mathbb{E}(XY)-\mathbb{E}(X)\mathbb{E}(Y)

Proof:

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XYXE[Y]YE[X]+E[X]E[Y]]=E[XY]E[XE[Y]]E[YE[X]]+E[E]X]E[Y])=E[XY]E[X]E[Y]E[Y]E[X]+E[X]E[Y]=E[XY]E[X]E[Y]\begin{aligned} Cov(X,Y) &=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] \\ &=\mathbb{E}[XY - X\mathbb{E}[Y]-Y\mathbb{E}[X] + \mathbb{E}[X]\mathbb{E}[Y]] \\ &=\mathbb{E}[XY] - \mathbb{E}[X\mathbb{E}[Y]] - \mathbb{E}[Y\mathbb{E}[X]] + \mathbb{E}[E]X]\mathbb{E}[Y]) \\ &=\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] - \mathbb{E}[Y]\mathbb{E}[X] + \mathbb{E}[X]\mathbb{E}[Y] \\ &=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y] \quad\quad\blacksquare \end{aligned}

Variance and covariance are translation invariant. Variance is quadratic, covariance is bilinear.

Var(aX+b)=a2Var(X)Var(aX+b)=a^2\cdot Var(X)

Cov(aX+b,cY+d)=acCov(X,Y)Cov(aX+b, cY+d)=ac\cdot Cov(X,Y)

Proof: Var(aX+b)=a2Var(X)Var(aX+b)=a^2\cdot Var(X)

Var(aX+b)=E((aX+b)2)(E(aX+b))2=E(a2X2+2abX+b2)(aE(X)+b)2=a2E(X2)+2abE(X)+b2a2E2(X)abE(X)b2=a2E(X2)a2E2(X)=a2Var(X)\begin{aligned} Var(aX+b) &= \mathbb{E}((aX+b)^2)-(\mathbb{E}(aX+b))^2 \\ &= \mathbb{E}(a^2X^2 + 2abX + b^2) - (a\mathbb{E}(X)+b)^2 \\ &= a^2\mathbb{E}(X^2) + 2ab\mathbb{E}(X)+b^2 - a^2\mathbb{E}^2(X) - ab\mathbb{E}(X) - b^2 \\ &= a^2\mathbb{E}(X^2)-a^2\mathbb{E}^2(X) \\ &= a^2 Var(X) \quad\quad\blacksquare \end{aligned}

Proof: Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)Var(X+Y) = Var(X)+Var(Y)+2Cov(X,Y)

Var(X+Y)=E[(X+Y)2]E2[X+Y]=E[X2+XY+Y2](E[X]+E[Y])2=E[X2]+E[XY]+E[Y2]E2[X]2E[X]E[Y]E2[Y]=(E[X2]E2[X])+(E[Y2]E2[Y])+(E[XY]2E[X]E[Y])=Var(X)+Var(Y)+2Cov(X,Y)\begin{aligned} Var(X+Y) &= \mathbb{E}[(X+Y)^2] - E^2[X+Y] \\ &=\mathbb{E}[X^2 + XY + Y^2] - (\mathbb{E}[X]+\mathbb{E}[Y])^2 \\ &=\mathbb{E}[X^2] + \mathbb{E}[XY] + \mathbb{E}[Y^2] - E^2[X] - 2\mathbb{E}[X]\mathbb{E}[Y] - E^2[Y] \\ &=(\mathbb{E}[X^2]-E^2[X]) + (\mathbb{E}[Y^2]-E^2[Y]) + (\mathbb{E}[XY]-2\mathbb{E}[X]\mathbb{E}[Y]) \\ &=Var(X) + Var(Y) + 2 Cov(X,Y) \quad\quad\blacksquare \end{aligned}

If XYX\perp \!\!\! \perp Y, then Cov(X,Y)=0Cov(X,Y)=0 and Var(X+Y)=Var(X)+Var(Y)Var(X+Y)=Var(X)+Var(Y)

Proof:

Cov(X,Y)=E(XY)E(X)E(Y)we know:X YE(XY)=E(X)E(Y)Thus, Cov(X,Y)=0Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)So we see independenceCovariance is 0: "uncorrelated"the converse is not true.Cov(X,Y)=0independence\begin{aligned} & Cov(X,Y)=\mathbb{E}(XY)-\mathbb{E}(X)\mathbb{E}(Y) \\ & \text{we know:} \\ & X\ \perp\!\!\!\perp Y\Rightarrow \mathbb{E}(XY)=\mathbb{E}(X)\mathbb{E}(Y) \\ & \text{Thus, } Cov(X,Y)=0 \Rightarrow Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y) \\ & \text{So we see independence} \Rightarrow \text{Covariance is 0: "uncorrelated"} \\ & \text{the converse is not true.} \\ & Cov(X,Y)=0 \cancel{\rightarrow} \text{independence} \end{aligned}

Remarks

We have E(X+Y)=E(X)+E(Y)\mathbb{E}(X+Y)=\mathbb{E}(X)+\mathbb{E}(Y).

If XYX\perp \!\!\! \perp Y, we also have:

It's important to remember that the first result and the other two results are of very different nature. While E(X+Y)=E(X)+E(Y)\mathbb{E}(X+Y)=\mathbb{E}(X)+\mathbb{E}(Y) is a property of expectation and holds unconditionally;

the other two, E(XY)=E(X)E(Y)\mathbb{E}(XY) = \mathbb{E}(X)\mathbb{E}(Y) and Var(X+Y)=Var(X)+Var(Y)Var(X+Y)=Var(X)+Var(Y), only hold if XYX\perp \!\!\! \perp Y.

It is more appropriate to consider them as properties of independence rather than properties of expectation and variance

2.6. Indicator

A random variable II is called an indicator, if

I(w)={1ωA0ωAI(w)= \begin{cases} \begin{aligned} & 1 & \omega\in A \\ & 0 & \omega\cancel{\in} A \end{aligned} \end{cases}

E(IA)=P(A) E(I_A) = P(A)

for some event AA

For AA given, II is also elevated as IAI_A

The most important property of indicator is its expectation gives the probability of the event E(IA)=P(A)\mathbb{E}(I_A)=\mathbb{P}(A)

Proof:

P(IA=1)=P(ω:IA(ω=1))=P(ω:ωA)=P(A)\begin{aligned} \mathbb{P}(I_A=1) &= \mathbb{P}({\omega:I_A(\omega=1)}) \\ &= \mathbb{P}({\omega:\omega\in A}) \\ &= \mathbb{P}(A) \end{aligned}

P(IA=0)=1P(A)E(IA)=1P(A)+c(1P(A))=P(A)\mathbb{P}(I_A=0) = 1- \mathbb{P}(A) \Rightarrow \mathbb{E}(I_A)=1\cdot \mathbb{P}(A)+c\cdot (1-\mathbb{P}(A)) = \mathbb{P}(A)

2.6.1. Example

we see IABer(P(A))I_A\sim Ber(\mathbb{P}(A))

Let XBin(n,p)X\sim Bin(n,p), XX is number of successes in nn Bernoulli trials, each with probability pp of success

X=I1++In\Rightarrow X=I_1+\cdots+I_n

where I1,,InI_1,\cdots,I_n are indicators for independent events. Ii=1I_i=1 if th ii the trial is a success. Ii=0I_i=0 if the ii th trial is a failure.

Hence IiI_i are idd(independent and identically distributed) r.v's

E(X)=E(I1++IN)=E(I1)++E(In)=p++p=np\begin{aligned} \Rightarrow \mathbb{E}(X) &= \mathbb{E}(I_1+\cdots+I_N) \\ &= \mathbb{E}(I_1)+\cdots+\mathbb{E}(I_n) \\ &= p + \cdots + p = n\cdot p \end{aligned}

Var(X)=Var(I1++In)=Var(I1)++Var(In)=nVar(Ii)=np(1p)\begin{aligned} Var(X) &= Var(I_1+\cdots+I_n) \\ &= Var(I_1)+\cdots+Var(I_n) \\ &= n\cdot Var(I_i) \\ &= n\cdot p(1-p) \end{aligned}

Var(I1)=E(I12)(E(I1))2=E(I1)(E(I1))2=pp2=p(1p)Var(I_1)=\mathbb{E}(I_1^2)-(\mathbb{E}(I_1))^2 = \mathbb{E}(I_1)-(\mathbb{E}(I_1))^2=p-p^2=p(1-p)

2.6.1. Example 3

Let XX be a r.v. taking values in non-negative integers, then

E(X)=n=0P(X>n)\mathbb{E}(X) = \sum_{n=0}^\infty P(X>n)

Proof:

Note that X=n=0InX=\sum_{n=0}^\infty I_n where In=Ix>nI_n = I_{x>n}. (x>nx>n is an event)

E(X)=E(n=0In)=n=0E(In)=n=0P(X>n)\begin{aligned} \mathbb{E}(X) &= \mathbb{E}(\sum_{n=0}^\infty I_n) \\ &= \sum_{n=0}^\infty \mathbb{E}(I_n) \\ &= \sum_{n=0}^\infty P(X>n) \end{aligned}

In particular, let XGeo(p)X\sim Geo(p). As we have seen, P(X>n)=(1p)nP(X>n)=(1-p)^n\Rightarrow

E(X)=n=0P(X>n)=n=0(1p)n=11(1p)=1p\begin{aligned} \mathbb{E}(X) &= \sum_{n=0}^\infty P(X>n) \\ &= \sum_{n=0}^\infty (1-p)^n \\ &= \frac{1}{1-(1-p)} = \frac{1}{p} \end{aligned}

2.7. Moment generating function

Definition: Let XX be a r.v. Then the function M(t)=E(etX)M(t)=\mathbb{E}(e^{tX}) is called the moment generating function(mgf) of XX, if the expectation exists for all t(h,h)t\in (-h, h) for some h>0h>0.

Remark: The mgf is not always well-defined. It is important to check the existence of the expectation.

2.7.1. Properties of mgf

  1. Moment Generating Function generates moments
  2. XYX\perp \!\!\! \perp Y, with mgf's Mx,MyM_x, M_y. Let MX+YM_{X+Y} be the mgf of X+YX+Y. then

    MX+Y(t)=MX(t)MY(t) M_{X+Y}(t)=M_X(t)M_Y(t)

    MX+Y(t)=E(et(X+Y))=E(etxety)=E(etx)E(ety)=MX(t)MY(t) \begin{aligned} M_{X+Y}(t) &= \mathbb{E}(e^{t(X+Y)}) \\ &= \mathbb{E}(e^{tx}e^{ty}) \\ &= \mathbb{E}(e^{tx})\mathbb{E}(e^{ty}) \\ &= M_X(t)M_Y(t) \end{aligned}

  3. The mgf completely determines the distribution of a r.v.

2.7.2. Joint mgf

Definition: Let X,YX,Y be r.v's. Then M(t1,t2):=E(et1X+t2Y)M(t_1, t_2):=\mathbb{E}(e^{t_1X+t_2Y}) is called the joint mgf of XX and YY, if the expectation exists for all t1(h1,h1)t_1\in(-h_1, h_1), t2(h2,h2)t_2\in(-h_2, h_2) for some h1,h2>0h_1, h_2 >0.

More generally, we can define M(t1,,tn)=E(exp(i=1ntiXi))M(t_1,\ldots, t_n)=\mathbb{E}(exp(\sum_{i=1}^n t_iX_i)) for r.v's X1,,XnX_1, \cdots,X_n, if the expectation exists for {(t1,,tn):ti(hi,hi),i=1,,n}\{(t_1,\cdots, t_n):t_i\in(-h_i,h_i), i=1,\cdots,n\} for some {hi>0},i=1,,n\{h_i>0\}, i=1,\cdots, n

2.7.2.1. Properties of the joint mgf

  1. MX(t)=E(etX)=E(etX+0Y)=M(t,0)MY(t)=M(0,t)\begin{aligned} M_X(t) &= \mathbb{E}(e^{tX}) \\ &=\mathbb{E}(e^{tX+0Y}) \\ &=M(t,0) \\ M_Y(t) &=M(0, t) \end{aligned}

  2. m+nt1mt2nM(t1,t2)(0,0)=E(XmYn)the proof is similar to the single r.v. case \frac{\partial^{m+n}}{\partial t_1^m \partial t_2^n} M(t_1, t_2)|_{(0,0)} = \mathbb{E}(X^mY^n) \\ \text{the proof is similar to the single r.v. case}

  3. If XYX\perp \!\!\! \perp Y, then M(t1,t2)=MX(t1)MY(t2)M(t_1, t_2)=M_X(t_1)M_Y(t_2)

3. Conditional distribution and conditional expectation

3.1. Conditional distribution

3.1.1. Discrete case

Definition Let XX and YY be discrete r.v's. The conditional distribution of XX given YY is given by:

P(X=xY=y)=(P(X=x,Y=u))P(Y=y)P(X=x | Y=y) = \frac{(P(X=x, Y=u))}{P(Y=y)}

P(X=xY=y):fXY=y(x).fXY(xy)conditional probability mass function)P(X=x|Y=y): f_{X|Y = y}(x).\quad\quad f_{X|Y}(x|y) \leftarrow \text{conditional probability mass function)}

Conditional pmf is a legitimate pmf: given any yy, fXY=y(x)0,xf_{X|Y=y}(x) \geq 0, \forall x

xfXY=y(x)=1\sum_x f_{X|Y=y}(x) = 1

Note that given Y=yY=y, as xx changes, the value of the function fXY=y(x)f_{X|Y=y}(x) is proportional to the joint probability.

fXY=y(x)P(X=x,Y=y)f_{X|Y=y}(x) \propto P(X=x, Y=y)

This is useful for solving problems where the denominator P(Y=y)P(Y=y) is hard to find.

3.1.1.1. Example

X1Poi(λ1),X2Poi(λ2)X_1\sim Poi(\lambda_1), X_2\sim Poi(\lambda_2). X1X2X_1\perp\!\!\!\perp X_2, Y=X1+X2Y=X_1 + X_2

Q: P(X1=kY=n)P(X_1=k|Y=n) ?

Note P(X1=kY=n)=fX1Y=n(k)P(X_1=k|Y=n) = f_{X_1|Y=n}(k)

A: P(X1=kY=n)P(X_1=k|Y=n) can only be non-zero for k=0,,nk=0, \cdots, n in this case,

P(X1=kY=n)=P(X1=k,Y=n)P(Y=n)P(X1=k,Y=n)=P(X1=k,X2=nk)=eλ1λ1kk!eλ2λ2nk(nk)!(λ1λ2)k/k!(nk)!\begin{aligned} P(X_1=k|Y=n) &= \frac{P(X_1=k, Y=n)}{P(Y=n)} \\ & \propto P(X_1=k, Y=n) \\ & = P(X_1=k, X_2=n-k) \\ & = e^{-\lambda_1}\frac{\lambda_1^k}{k!}\cdot e^{-\lambda_2}\frac{\lambda_2^{n-k}}{(n-k)!} \\ & \propto (\frac{\lambda_1}{\lambda_2})^k / k!(n-k)! \end{aligned}

we can get P(X=kY=n)P(X=k|Y=n) by normalizing the above expression.

P(X1=k,Y=n)=(λ1λ2)k/k!(nk)!k=0n(λ1λ2)k/k!(nk)!\begin{aligned} P(X_1=k, Y=n) = \frac{(\frac{\lambda_1}{\lambda_2})^k / k!(n-k)!}{\sum_{k=0}^n(\frac{\lambda_1}{\lambda_2})^k/k!(n-k)!} \end{aligned}

but then we will need to fine k=0n(λ1λ2)k/k!(nk)!\sum_{k=0}^n(\frac{\lambda_1}{\lambda_2})^k/k!(n-k)!

An easier way is to compare k=0n(λ1λ2)k/k!(nk)!\sum_{k=0}^n(\frac{\lambda_1}{\lambda_2})^k/k!(n-k)! with the known results for common distribution. In particular, if XBin(n,p)X\sim Bin(n, p)

P(X=k)=(kn)pk(1p)nk(p1p)k/k!(nk)!\begin{aligned} P(X=k) &= (\stackrel{n}{k})p^k(1-p)^{n-k} \\ &\propto (\frac{p}{1-p})^k/k! (n-k)! \end{aligned}

P(X1=kY=n)\Rightarrow P(X_1=k|Y=n) follows a binomial distributions with parameters nn and pp given by p1p=λ1λ2p=λ1λ1+λ2\frac{p}{1-p}=\frac{\lambda_1}{\lambda_2} \Rightarrow p=\frac{\lambda_1}{\lambda_1+\lambda_2}

Thus, given Y=X1+X2=nY=X_1+X_2=n, the conditional distribution of X1X_1 is binomial with parameter nn and λ1λ1+λ2\frac{\lambda_1}{\lambda_1+\lambda_2}

3.1.2. Continuous case

Definition: Let XX and YY be continuous r.v's. The conditional distribution of XX given YY is given by

fXY(xy)=fXY=y(x)=f(x,y)fY(y)f_{X|Y}(x|y) = f_{X|Y=y}(x) = \frac{f(x,y)}{f_Y(y)}

A conditional pdf is a legitimate pdf

fXY(xy)0x,yRfXY(xy)dx=1,yR\begin{aligned} f_{X|Y}(x|y) \geq 0\quad \quad\quad& x,y\in \R\\ \int_{-\infty}^\infty f_{X|Y}(x|y) dx = 1,\quad &y\in \R \end{aligned}

3.1.2.1. Example

Suppose XExp(λ)X\sim Exp(\lambda), YX=xExp(x)=fYX(yx)=xexy,y=eY|X=x \sim Exp(x) = f_{Y|X}(y|x) = x e^{-xy}, y = e \leftarrow conditional distribution of YY given X=xX=x

Q: Find the condition pdf fXY(xy)f_{X|Y}(x|y)

A:

fXY(xy)=f(x,y)fY(y)f(x,y)=fYX(yx)fX(x)=xexyλeλxxex(y+λ),x>0,y>0\begin{aligned} f_{X|Y}(x|y) &= \frac{f(x,y)}{f_Y(y)} \\ & \propto f(x,y) \\ & = f_{Y|X}(y|x)\cdot f_X(x) \\ & = xe^{-xy}\lambda e^{-\lambda x} \\ & \propto xe^{-x(y+\lambda)}, \quad\quad x>0, y>0 \end{aligned}

Normalization ( make the total probability 1)

fXY(xy)=xex(y+λ)0xex(y+λ)dx0xex(y+λ)dx=(1λ+y)2integration by parts\begin{aligned} f_{X|Y}(x|y) & = \frac{xe^{-x(y+\lambda)}}{\int_0^\infty xe^{-x(y+\lambda)}dx} \\ \int_0^\infty xe^{-x(y+\lambda)}dx &= (\frac{1}{\lambda + y})^2 \leftarrow \text{integration by parts} \end{aligned}

Thus, fXY(xy)=(λ+y)2xex(y+λ)f_{X|Y}(x|y) = (\lambda + y)^2xe^{-x(y+\lambda)}, x>0x>0.

This is a gamma distribution with parameters γ\gamma and λ+y\lambda + y

3.1.2.1. Example 2

Find the distribution of Z=XYZ=XY.

Attention: the following method is wrong:

fZ(z)=0fYX(zxx)fX(x)dxf_Z(z) = \int_0^\infty f_{Y|X}(\frac{z}{x}|x)\cdot f_X(x)dx

If we want to directly work with pdf's, we will need to use the change of variable formula for multi-variables. The right formula have turns out to be

fZ(z)=0fX,Z(x,z)dx=0fZX(zx)fX(x)dx=0f(x,zx)1xdx=fYX(zxx)fX(x)1xdx\begin{aligned} f_Z(z) & = \int_0^\infty f_{X,Z}(x,z)dx = \int_0^\infty f_{Z|X}(z|x) f_X(x)dx \\ &= \int_0^{\infty} f(x, \frac{z}{x})\cdot \frac{1}{x}dx \\ &= f_{Y|X}(\frac{z}{x}|x) f_X(x)\cdot\frac{1}{x}dx \end{aligned}

As an easier way is to use cdf, which gives probability rather than density:

P(Z<z)=P(XYz)=0P(XYzX=x)fX(x)dx(law of total probability)=0P(YzxX=x)fX(x)dxYX=xExp(x)=0(1exzx)λeλxdx=1ez0λeλxdx=1ezZExp(1)\begin{aligned} P(Z<z) & = P(XY\leq z) \\ & = \int_0^\infty P(XY\leq z|X=x) f_X(x) dx \quad\quad (\text{law of total probability}) \\ &= \int_0^\infty P(Y\leq\frac{z}{x}|X=x)\cdot f_X(x)dx \\ Y|X=x \sim Exp(x) \\ &= \int_0^\infty(1-e^{-x\cdot\frac{z}{x}})\cdot\lambda e^{-\lambda x} dx \\ &= 1 - e^{-z}\int_0^\infty \lambda e^{-\lambda x}dx \\ &= 1-e^{-z} \Rightarrow Z\sim Exp(1) \end{aligned}

Notation X,Y{Z=k}iidX,Y|\{ Z = k \}\stackrel{iid}{\sim}\cdots means that given Z=kZ=k, XX and YY are conditionally independent, and they follow certain distribution.

(the conditional joint cdf/pmf/pdf equals the predict of the conditional cdf's/pmf's/pdf's)

3.2. Conditional expectation

We have seen that conditional pmf/pdf are legitimate pmf/pdf. Correspondingly, a conditional distribution is nothing else but a probability distributions. It is simply a (potentially) different distribution, since it takes more information into consideration.

As a result, we can define everything which are previously defined for unconditional distributions also for conditional distributions.

In particular, it is natural to define the conditional expectation.

Definition. The conditional expectation of g(X)g(X) given Y=yY=y is defined as

E(g(X)Y=y)={i1g(xi)P(X=xuY=y) if XY=y is discreteg(x)fXY(xy)dx if XX=y is continuous\mathbb{E}(g(X)|Y=y) = \begin{cases} \sum_{i_1}^\infty g(x_i) P(X=x_u|Y=y) \quad\quad \text{ if } X|Y=y \text{ is discrete} \\ \int_{-\infty}^\infty g(x)f_{X|Y}(x|y) dx \quad\quad\quad\quad\quad \text{ if } X|X=y \text{ is continuous} \end{cases}

Fix yy, the conditional expectation is nothing but the expectation taken under the conditional distribution.

3.2.1. What is E(XY)  ?\mathbb{E}(X|Y)\;?

Different ways to understand conditional expectation

  1. Fix a value yy, E(g(X)Y=y)\mathbb{E}(g(X)|Y=y) is a number
  2. As yy changes E(g(x)Y=y)\mathbb{E}(g(x)|Y=y) becomes a function of yy (that each yy gives a value): h(y)=:E(g(x)Y=y)h(y) =: \mathbb{E}(g(x)|Y=y)
  3. since yy is actually random, we can define E(g(x)Y)=h(Y)\mathbb{E}(g(x)|Y)=h(Y). This is a random variable

    E(g(x)Y))(ω)=E(g(x)Y=Y(ω)\mathbb{E}(g(x)|Y))_{(\omega)} = \mathbb{E}(g(x)|Y=Y(\omega)

    ωΩ\omega \in \Omega this random variable takes value E(g(x)Y=y)\mathbb{E}(g(x)|Y=y) When Y=yY=y

    ΩRh(Y)(ω)=h(Y(ω))\Omega \rightarrow \R \\ h(Y)_{(\omega)} = h(Y(\omega))

3.2.2. Properties of conditional expectation

  1. Linearity (inherited from expectation)

    E(aX+bY=y)=aE(XY=y)+b\mathbb{E}(aX+b | Y = y) = a\mathbb{E}(X|Y=y) +b

    E(X+ZY=y)=E(XY=y)+E(ZY=y)\mathbb{E}(X+Z|Y=y) = \mathbb{E}(X|Y=y)+\mathbb{E}(Z|Y=y)

    1. E(g(X,Y)Y=y)=E(g(X,y)Y=y)=E(g(X,y)) when X and Y are not independent\mathbb{E}(g(X,Y)|Y=y) = \mathbb{E}(g(X,y)|Y=y) \cancel{=} \mathbb{E}(g(X,y)) \text{ when X and Y are not independent}

    Proof (Discrete):

    E(g(X,Y)Y=y)=xiyjg(xi,yj)P(X=xi,Y=yjY=y)P(X=xi,Y=yjY=y)={0 if yj=yP(X=x1,Y=yj)/P(Y+y)=P(X=xiY=y) if yj=yE(g(X,Y)Y=y)=xig(xi,y)P(X=xiY=y)=E(g(X,y)Y=y)g(X,y) regarded as a function of x\mathbb{E}(g(X,Y)|Y=y) =\sum_{x_i}\sum_{y_j}g(x_i,y_j)\cdot P(X=x_i,Y=y_j|Y=y) \\ \\ P(X=x_i,Y=y_j|Y=y) = \begin{cases} \begin{aligned} &0 & \text{ if } y_j\cancel{=}y \\ \\ &P(X=x_1,Y=y_j)/P(Y+y)=P(X=x_i|Y=y) \quad\quad&\text{ if } y_j=y \end{aligned} \end{cases}\\ \begin{aligned} \\\\ \Rightarrow \mathbb{E}(g(X,Y)|Y=y) &= \sum_{x_i}g(x_i,y)\cdot P(X=x_i|Y=y) \\ &= \mathbb{E}(g(X,y)|Y=y) &g(X,y)\text{ regarded as a function of } x \end{aligned}

    In particular,

    E(g(X)h(Y)Y=y)=h(y)E(g(X)Y=y) \mathbb{E}(g(X)\cdot h(Y) | Y=y) = h(y) \mathbb{E}(g(X)|Y=y)

    E(g(X)h(Y)Y)=h(Y)E(g(X)Y) \mathbb{E}(g(X)\cdot h(Y)|Y) = h(Y)\mathbb{E}(g(X)|Y)

  2. If XYX\perp Y, then E(g(X)Y=y)=E(g(X))\mathbb{E}(g(X)|Y=y) = \mathbb{E}(g(X))

    Fact: If XYX\perp Y, then conditional distribution of XX given Y=yY=y is the same as the unconditional distribution of XX

    Proof(Discrete):

    if XYP(X=xiY=yj)=P(X=xiY=yj)/P(Y=yj)=P(X=xi)P(Y=yj)/P(Y=yj)=P(X=xi)\begin{aligned} & \text{if } X\perp Y \\ & P(X=x_i|Y=y_j) \\ &\quad= P(X=x_i|Y=y_j) / P(Y=y_j) \\ &\quad= P(X=x_i)P(Y=y_j)/P(Y=y_j) \\ &\quad=P(X=x_i) \end{aligned}

  3. Law of iterated expectation (or double expectation): Expectation of conditionally expectation is its unconditional expectation

    E(E(XY)))=E(X)\mathbb{E}(\mathbb{E}(X|Y)))=\mathbb{E}(X)

    E(XY)\mathbb{E}(X|Y) is a r.v, a function of YY.
    Proof(Discrete):

    When Y=yjY=y_j, the r.v. E(XY)=E(XY=yj)=xixiP(X=xiY=yj)\mathbb{E}(X|Y)=\mathbb{E}(X|Y=y_j) = \sum_{x_i}x_iP(X=x_i|Y=y_j). This happens with probability P(Y=yj)P(Y=y_j)

    E(E(XY))=yj(xixiP(X=xiY=yj))P(Y=yj)=xiyjxiP(X=xiY=yj)P(Y=yj)=xixiyjP(X=xiY=yj)P(Y=yj)law of total probability=xixiP(X=xi)=E(X) \Rightarrow \begin{aligned} \mathbb{E}(\mathbb{E}(X|Y)) &= \sum_{y_j}(\sum_{x_i}x_iP(X=x_i|Y=y_j))P(Y=y_j) \\ &= \sum_{x_i}\sum_{y_j} x_iP(X=x_i|Y=y_j)P(Y=y_j) \\ &= \sum_{x_i}x_i\sum_{y_j}P(X=x_i|Y=y_j)P(Y=y_j) &\text{law of total probability} \\ &= \sum_{x_i}x_i P(X=x_i) = \mathbb{E}(X) \end{aligned}

    Alternatively,

    xiyjxiP(X=xiY=yj)P(Y=yj)=xiyjxiP(X=xi,Y=yj)g(X,Y)=X at (xi,yj)=E(X)\begin{aligned} &\sum_{x_i}\sum_{y_j} x_i P(X=x_i|Y=y_j)P(Y=y_j) \\ &\quad=\sum_{x_i}\sum_{y_j} x_i P(X=x_i,Y=y_j) & g(X,Y)=X \text{ at } (x_i, y_j)\\ &\quad= \mathbb{E}(X) \end{aligned}

    Example:

    YY: # of claims received by insurance company
    XX: some random parameter

    YXPoi(X),XExp(λ)Y|X\sim Poi(X), X\sim Exp(\lambda)

    a) E(Y)\mathbb{E}(Y) ?
    b) P(Y=n)P(Y=n) ?

    a)

    YXPoi(X)E(YX=x)=xE(YX)=XE(Y)=E(E(YX))=E(X)=1λY|X\sim Poi(X) \Rightarrow \mathbb{E}(Y|X=x) = x \Rightarrow \mathbb{E}(Y|X)=X \\ \therefore \begin{aligned} \\ \mathbb{E}(Y) &= \mathbb{E}(\mathbb{E}(Y|X)) \\ &= \mathbb{E}(X) = \frac{1}{\lambda} \end{aligned}

    b)

    P(Y=n)=0P(Y=nX=x)fX(x)dx=oexxnn!λeλxdx=λn!0xne(λ+1)xdx=λ(λ+1)n+1n!0((λ+1)x)ne(λ+1)xd(λ+1)x=λ(λ+1)n+1n!Γ(n+1)Γ(n+1)=n! ; formula for gamma function or integration by parts=λ(λ+1)n+1=(1λ+1)n1λ+1Y+1Geo(λ/(λ+1)) \begin{aligned} P(Y=n) &= \int_0^\infty P(Y=n|X=x)f_X(x)dx \\ &= \int_o^\infty \frac{e^{-x}x^n}{n!}\cdot \lambda e^{-\lambda x} dx \\ &= \frac{\lambda}{n!}\int_0^\infty x^n e^{-(\lambda+1)x}dx \\ &= \frac{\lambda}{(\lambda+1)^{n+1}n!}\int_0^\infty((\lambda+1)x)^n e^{-(\lambda+1)x}d(\lambda+1)x \\ &= \frac{\lambda}{(\lambda+1)^{n+1}n!}\Gamma(n+1) &\Gamma(n+1) = n! \text{ ; formula for gamma function or integration by parts} \\ &= \frac{\lambda}{(\lambda+1)^{n+1}} = (\frac{1}{\lambda+1})^n\cdot\frac{1}{\lambda+1} \\ &\Rightarrow Y+1\sim Geo(\lambda/(\lambda+1)) \end{aligned}

    We verify that E(Y)=λ+1λ1=1λ\mathbb{E}(Y)=\frac{\lambda +1}{\lambda}-1=\frac{1}{\lambda}

3.3. Decomposition of variance (EVVE's low)

Definition: The conditional variance of YY given X=xX=x is defined as

Var(YX=x)=E((YE(YX=x))2X=x)Var(Y|X=x)=\mathbb{E}((Y-\mathbb{E}(Y|X=x))^2|X=x)

Var(YX)(ω)=Var(YX=X(ω))Var(YX)(ω): a r.v, a function of XVar(Y|X)_{(\omega)} = Var(Y|X=X_{(\omega)}) \quad \quad Var(Y|X)_{(\omega)} \text{: a r.v, a function of }X

The conditional variance is simply the variance taken under the conditional distribution

V(YX=x)=E(Y2X=x)(E(YX=x))2\Rightarrow V(Y|X=x)=\mathbb{E}(Y^2|X=x)-(\mathbb{E}(Y|X=x))^2

Thus

Var(Y)=E(Var(YX))+Var(E(YX))Var(Y)=\mathbb{E}(Var(Y|X))+Var(\mathbb{E}(Y|X))

E(Var(YX)):"intra-group variance" Var(E(YX)):"inter-group variance"\mathbb{E}(Var(Y|X)): \text{"intra-group variance" } \quad\quad Var(\mathbb{E}(Y|X)): \text{"inter-group variance"}

Proof:

RHS=E(E(Y2X)(E(YX))2)+E((E(YX))2)(E(E(YX)))2=E(E(Y2X))E((E(YX))2)+E((E(YX))2)(E(E(YX)))2=E(Y2)(E(Y))2=Var(Y)\begin{aligned} RHS &= E(E(Y^2|X)-(E(Y|X))^2) + E((E(Y|X))^2) - (E(E(Y|X)))^2 \\ &=E(E(Y^2|X)) - \sout{E((E(Y|X))^2)} + \sout{E((E(Y|X))^2)} - (E(E(Y|X)))^2 \\ &=E(Y^2) - (E(Y))^2 \\ &=Var(Y) \end{aligned}

4. Stochastic Processes

Stochastic Processes

  1. sequence / family of random variables
  2. a random function (hard to formulate)

Definition: A stochastic process {Xt}tT\{X_t\}_{t\in T} is a collection of random variables, defined on a common probability space.

TT: index set. In most cases, TT corresponds to time, and is either discrete {0,1,2,}\{0,1,2,\cdots\} or continuous [0,)[0,\infty)

In discrete case, we writes {Xn}n=0,1,2,\{X_n\}_{n=0,1,2,\ldots}

This state space SS os a stochastic process is the set of all possible value of Xt,tTX_t, t\in T

SS can also be either discrete or continuous. In this course, we typically deal with discrete state space. Then we relabel the states so that S={0,1,2,}S=\{0,1,2,\cdots\} (countable state space) or S={0,1,2,,M}S=\{0,1,2,\cdots,M\} (finite state space)

Remark: As in the case of the joint distribution, we need the r.v's in a stochastic process to be defined on a common probability space, because we want to discuss their joint behaviours, i.t, how things change over time.

Stochastic Processes Graph

Thus, we can identify each point ω\omega in the sample space Ω\Omega with a function defined on TT and taking value in SS. Each function is called a path of this stochastic process

Example Let X0,X1,X_0, X_1, \cdots be independent and identical r.v's following some distribution. Then {Xn}n=0,1,2,...\{X_n\}_{n=0,1,2,...} is a stochastic process

Stochastic Processes Example1

Example Let X1,X2,...X_1, X_2,... be independent and identical r.v.'s. P(X1=1)=pP(X_1=1)=p, and P(X1=1)=1pP(X_1=-1)=1-p. Define S0=0,Sn=i=1nXi,n1S_0=0, S_n=\sum_{i=1}^n X_i, n\leq 1, e.g.

Then {Sn}n=0,1,...\{S_n\}_{n=0,1,...} is a stochastic process, with state space S=ZS=\Z (integer)

Stochastic Processes Example2

4.1. Markov Chain

4.1.1. Simple Random Walk

{Sn}n=0,1,...\{S_n\}_{n=0,1,...} is called a "simple random walk". (Sn=Sn1+XnS_n=S_{n-1}+X_n)

Sn={Sn1+1Sn11S_n=\begin{cases} S_{n-1} + 1 \\ S_{n-1} - 1 \end{cases}

Remark: Why we need the concept of "stochastic process"? Why don't we just look at the joint distribution of (X0,X1,...,Xn)(X_0, X_1,...,X_n)?

Answer: The joint distribution of a large number of r.v's is very complicated, because it does not take advantage of the special structure of TT(time).

For example, simple random walk. The full distribution of (S0,S1,...,Sn)(S_0, S_1, ..., S_n) is complicated for nn large. However, the structure is actually simple if we focus on the adjacent times:

Sn+1=Sn+Xn+1S_{n+1}=S_n+X_{n+1}

Sn: last value. Xn+1:related to Ber(p). They are independentS_n: \text{ last value. } \quad X_{n+1}: \text{related to }Ber(p). \text{ They are independent}

By introducing time into the framework, we can greatly simplify many things.

More precisely, we find that for simple random walk, {Sn}n=0,1,...\{S_n\}_{n=0,1,...}, if we know SnS_n the distribution of Sn+1S_n+1 will not depend on the history (S0,...,Sn1)(S_0, ..., S_{n-1}). This is a very useful property

In general for a stochastic process {Xn}n=0,1,...\{X_n\}_{n=0,1,...}, at time nn, we already know X0,X1,...,XnX_0, X_1,..., X_n, S0S_0; our best estimate of the distribution of Xn+1X_{n+1} should be the conditional distribution:

Xn+1Xn,...,XnX_{n+1}|X_n,...,X_n

given by:

P(Xn+1=xn+1Xn=xn,...,X0=x0)P(X_{n+1}=x_{n+1}|X_n=x_n,..., X_0=x_0)

As time passes, the expression becomes more and more complicated \rightarrow impossible to handle.

However, if we know that this conditional distribution is actually the same as the conditional distribution only given XnX_n, then the structure will remain simple for any time. This motivates the notion of Markov chain.

4.1.2. Markov Chain

4.1.2.1. Discrete-time Markov Chain

Definition and Examples

Definition: A discrete-time Stochastic process {Xn}n=0,1,...\{X_n\}_{n=0,1,...} is called a discrete-time Markov Chain (DTMC), if its state space SS is discrete, and it has the Markov property:

P(Xn+1=xn+1Xn=xn,...,Xo=xo)=P(Xn+1=xn+1Xn=xn)\begin{aligned} &\quad P(X_{n+1}=x_{n+1}|X_n=x_n,...,X_o=x_o) \\ &= P(X_{n+1}=x_{n+1}|X_n=x_n) \end{aligned}

for all n,x0,...,xn,xn+1Sn, x_0,...,x_n,x_{n+1}\in S

If Xn+1{xn=i}X_{n+1}|\{x_n=i\} does not change over time, P(Xn+1=jNn=i)=P(X1=jX0=i)P(X_{n+1}=j|N_n=i)=P(X_1=j|X_0=i), then we call this Markov chain time-homogeneous (default setting for this course).

P(Xn+1=xn+1Xn=xn,...,X0=x0)Xn+1=xn+1: future; Xn=xn: present(state)=P(Xn+1=xx+1Xn=xn)Xn1=xn1,...,X0=x0: past(history)\begin{aligned} &\quad P(X_{n+1}=x_{n+1}|X_n=x_n,...,X_0=x_0 ) \quad\quad &X_{n+1}=x_{n+1}\text{: future; } X_{n}=x_{n}\text{: present(state)} \\ &= P(X_{n+1}=x_{x+1}|X_n=x_n) &X_{n-1}=x_{n-1},...,X_{0}=x_{0}\text{: past(history)} \end{aligned}

Intuition: Given the present state, the past and the future are independent. In other words, the future depends on the previous results only through the current state.

Example: simple random walk

The simple random walk {Sn}n=0,1,...\{S_n\}_{n=0,1,...} is a Markov chain

Proof:

Recall that Sn+1=Sn+Xn+1S_{n+1}=S_n+X_{n+1}

if sn+1=sn±1s_{n+1}\cancel{=} s_n \pm 1

P(Sn+1=sn+1Sn=sn,...,S0=s0)=0=P(Sn+1=sn+1Sn=sn)\begin{aligned} &\quad P(S_{n+1}=s_{n+1}|S_n=s_n,...,S_0=s_0) \\ & = 0 \\ & = P(S_{n+1}=s_{n+1}|S_n=s_n) \end{aligned}

P(Sn+1=sn+1Sn=sn,...,s0=0)=P(Xn+1Sn=sn,...,S0=0)=P(Xn+1=1)Xn+1(X1,...,Xn) hence also (S0,...,Sn)\begin{aligned} &\quad P(S_{n+1}=s_n+1|S_n=s_n,...,s_0=0) \\ &=P(X_{n+1}|S_n=s_n,...,S_0=0) \\ &=P(X_{n+1}=1) \quad\quad\quad X_{n+1} \perp (X_1,...,X_n) \text{ hence also } (S_0, ..., S_n) \end{aligned}

Similarly,

P(Sn+1=sn+1Sn=sn)=P(Xn+1=1Sn=sn)=P(Xn+1=1)P(Sn+1Sn=sn,...,S0=s0)\begin{aligned} &\quad P(S_{n+1}=s_n+1|S_n=s_n) \\ &=P(X_{n+1}=1|S_n=s_n) \\ &=P(X_{n+1}=1) \\ &\Rightarrow P(S_{n+1}|S_n=s_n,...,S_0=s_0) \end{aligned}

Similarly,

P(Sn+1=sn1Sn=sn,...,S0=0)=P(Sn+1=sn1Sn=sn)=P(Xn+1=1){Sn}n=0,1,... is a DTMC\begin{aligned} &\quad P(S_{n+1}=s_n-1|S_n=s_n,...,S_0=0) \\ &=P(S_{n+1}=s_n-1|S_n=s_n) \\ &=P(X_{n+1}=-1) \\ &\Rightarrow \{S_n\}_{n=0,1,...} \text{ is a DTMC} \quad\quad\blacksquare \end{aligned}

4.1.3. One-step transition probability matrix

For a time-homogeneous DTMC, define

Pij=P(X1=jX0=i)=P(Xn+1=jXn=i)n=0,1,...\begin{aligned} P_{ij} &= P(X_1=j|X_0=i) \\ &= P(X_{n+1}=j|X_n=i) \quad\quad n=0,1,... \end{aligned}

PijP_{ij}: one step transition probability

The collection of Pij,i,jSP_{ij}, i,j\in S governs all the one-step transitions of the DTMC. Since it has two indices ii and jj; it naturally forms a matrix P={Pij}i,jSP=\{P_{ij}\}_{i,j\in S}, called the (one-setp) transition (probability) matrix or transition matrix

Property of a transition matrix P={Pij}i,jSP=\{P_{ij}\}_{i,j\in S}:

Pij0i,jSjSPij=1iS the row some of P are all 1\begin{aligned} &P_{ij}\geq 0\quad \forall i,j\in S \\ &\sum_{j\in S}P_{ij}=1 \quad \forall i\in S \quad\rightarrow\text{ the row some of } P \text { are all }1 \end{aligned}

Reason:

jSPij=jSP(X1=jX0=i)=P(X1SXo=i)=1\begin{aligned} \sum_{j\in S}P_{ij} &=\sum_{j\in S} P(X_1=j|X_0=i) \\ &= P(X_1\in S|X_o=i) \\ &= 1 \end{aligned}

Example 4.1.3.1. simple random walk

There will be 3 cases:

Pi,i+1=P(S1=i+1S0=i)=P(X1=1)=pPi,i1=P(S1=i1S0=i)=P(X1=1)=1p=:qPi,j=0 for j=i±1\begin{aligned} & P_{i,i+1} = P(S_1=i+1|S_0=i) = P(X_1=1)= p \\ & P_{i,i-1} = P(S_1=i-1|S_0=i) = P(X_1=-1) = 1 - p =:q \\ & P_{i,j}=0 \quad\quad\quad\quad\text{ for } j\cancel{=}i\pm 1 \\ \end{aligned}

(infinite dimension)p={........................0p0............q0p...............q0p...............q0p........................}\Rightarrow \text{(infinite dimension)}p=\begin{Bmatrix} ... & ... & ... & ... & ... & ... & ...\\ ... & 0 & p & 0 & ... & ... & ...\\ ... & q & 0& p & ... & ... & ...\\ ...& ... & q & 0 & p & ... & ... \\ ... & ... & ... & q & 0 & p & ... \\ ... & ... & ... & ... & ... & ... & ... \end{Bmatrix}

Example 4.1.3.2. Ehrenfest's urn

Two urns A,BA, B, total MM balls. Each time, pick one ball randomly(uniformly), and move it to the opposite urn.

Xn:# of balls in A after step nX_n: \# \text{ of balls in } A \text{ after step }n

S={0,1,...,M}S=\{0,1,...,M\}

Pij=P(X1=jX0=j)(i balls in AMi balls in B)={i/Mj=i1(Mi)/Mj=i+10j=i±1\begin{aligned} P_{ij} & =P(X_1=j|X_0=j) \quad\quad\text{($i$ balls in $A$, $M-i$ balls in $B$)}\\ & =\begin{cases} \begin{aligned} & i/M \quad\quad\quad\quad\quad\quad&j=i-1\\ & (M-i)/M &j=i+1 \\ & 0 &j\cancel{=}i\pm 1 \end{aligned} \end{cases} \end{aligned}

p={011/M0(M1)/M1/M0(M1)/M2/M0(M2)/M.....................(M1)/M01/M10}p=\begin{Bmatrix} 0 & 1 \\ 1/M & 0 & (M-1)/M \\ &1/M & 0 & (M-1)/M \\ &&2/M & 0 & (M-2)/M \\ ...&...&...&...&...&...&...\\ &&&&(M-1)/M & 0 & 1/M \\ &&&&&1&0 \end{Bmatrix}

Example 4.1.3.3: Gambler's ruin

A gambler, each time wins 1 with probability pp, losses 1 with probability 1p=q1-p=q. Initial wealth S0=aS_0=a; wealth at time nn: SnS_n. The gambler leaves if Sn=0S_n=0 (loses all money) or Sn=M>aS_n=M>a (wins certain amount of money and gets satisfied)

This is a variant of the simple random walk, where we have absorbing barriers(Pii=1P_{ii}=1) at 00 and MM

S={0,...,M}S=\{0,...,M\}

Pij={pj=i+1,i=1,...,M1qj=i1,i=1,...,M11i=j=0 or i=j=M0otherwiseP_{ij}=\begin{cases} \begin{aligned} & p \quad\quad & j=i+1, i=1,...,M-1 \\ & q & j=i-1, i=1,...,M-1 \\ & 1 & i=j=0 \text{ or } i=j=M \\ & 0 & \text{otherwise} \end{aligned} \end{cases}

p={10...q0p......q0p......q0p...........................q0p...01}p=\begin{Bmatrix} 1 & 0 &...& \\ q & 0 & p & ... \\ ... & q & 0 & p & ... \\ &... & q & 0 & p & ... \\ ...&...&...&...&...&...&...\\ & & & ... & q & 0 & p \\ & & & & ... & 0 & 1 \end{Bmatrix}

Example 4.1.3.4: Bonus-Malus system

Insurance company has 4 premium levels: 1, 2, 3, 4

Let Xn{1,2,3,4}X_n\in\{1,2,3,4\} be the premium level for a customer at year nn

YniidPoi(λ): # of claims at year nY_n \stackrel{iid}{\sim}Poi(\lambda) : \text{ \# of claims at year $n$}

Denote ak=P(Yn=k),k=0,1,...a_k=P(Y_n=k), k=0,1,...

p={a0a1a2(1a0a1a2)a00a1(1a0a1)0a00(1a0)00a0(1a0)}p=\begin{Bmatrix} a_0 & a_1 & a_2 & (1-a_0-a_1-a_2) \\ a_0 & 0 & a_1 & (1-a_0-a_1) \\ 0 & a_0 & 0 & (1-a_0) \\ 0 & 0 & a_0 & (1-a_0) \end{Bmatrix}

4.2. Chapman-Kolmogorov equations

Q: Given the (one-step) transition matrix, P={Pij}i,jSP=\{P_{ij}\}_{i,j\in S}, how can we decide the n-step transition probability

Pij(n):=P(Xn=jX0=i)=P(Xn+m=jXm=i),m=0,1,...\begin{aligned} P_{ij}^{(n)} &:= P(X_n=j|X_0=i) \\ &=P(X_{n+m}=j|X_m=i), \quad m=0,1,... \end{aligned}

As a special case, let us start with Pij(2)P_{ij}^{(2)} and their collection p(2)={Pij(2)}i,jSp^{(2)}=\{P_{ij}^{(2)}\}_{i,j\in S} (also a square matrix, same dimension as PP)

Condition on what happens at time 11:

Pij(2)=P(X2=jX0=i)=jSP(X2=jX0=i,X1=k)P(X1=kX0=i)conditional law of total probability\begin{aligned} P_{ij}^{(2)} &= P(X_2=j|X_0=i) \\ &= \sum_{j\in S} P(X_2=j|X_0=i, X_1=k) \cdot P(X_1=k | X_0=i) \quad \text{conditional law of total probability} \end{aligned}

4.2.1. Conditional Law of total probability

P(X2=jX0=i)=kSP(X2=j,X1=kX0=i)=kSP(X2=j,X1=k,X0=i)P(X0=i)=kSP(X2=j,X1=k,X0=i)P(X1=k,X0=i)P(X1=k,X0=i)P(X0=i)=kSP(X2=jX0=i,X1=k)P(X1=kX0=i)\begin{aligned} &\quad P(X_2=j|X_0=i) \\ &=\sum_{k\in S} P(X_2=j, X_1=k | X_0=i) \\ &=\sum_{k\in S} \frac{P(X_2=j, X_1=k, X_0=i)}{P(X_0=i)} \\ &=\sum_{k\in S} \frac{P(X_2=j, X_1=k, X_0=i)}{P(X_1=k, X_0=i)} \cdot\frac{P(X_1=k, X_0=i)}{P(X_0=i)} \\ &=\sum_{k\in S} P(X_2=j|X_0=i,X_1=k)\cdot P(X_1=k|X_0=i) \end{aligned}

continue on Pij(2)P_{ij}^{(2)}

Pij(2)=P(X2=jX0=i)=jSP(X2=jX0=i,X1=k)P(X1=kX0=i)conditional law of total probability=kSP(X2=jX1=k)P(X1=kX0=i)=kSP(X1=jX0=k)P(X1=kX0=i)=kSPikPkj=(PP)ij\begin{aligned} P_{ij}^{(2)} &= P(X_2=j|X_0=i) \\ &= \sum_{j\in S} P(X_2=j|X_0=i, X_1=k) \cdot P(X_1=k | X_0=i) \quad \text{conditional law of total probability} \\ &=\sum_{k\in S} P(X_2=j|X_1=k)\cdot P(X_1=k|X_0=i) \\ &=\sum_{k\in S} P(X_1=j|X_0=k)\cdot P(X_1=k|X_0=i) \\ &=\sum_{k\in S} P_{ik}\cdot P_{kj} \\ &= (P \cdot P)_{ij} \end{aligned}

Thus, P(2)=PP=P2P^{(2)} = P\cdot P=P^2

Using the smae idea, for n,m=0,1,2,3...n,m=0,1,2,3...:

Pij(n+m)=P(Xn+m=jX0=i)=kSP(Xn+m=jX0=i,Xm=k)P(Xm=kX0=i)=kSP(Xn+m=jXm=k)P(Xm=kX0=i)Markov property=kSP(Xn=jX0=k)P(Xm=kX0=i)=kSpik(m)Pkj(n)=(P(m)P(n))ijP(n+m)=P(m)P(n)()\begin{aligned} P_{ij}^{(n+m)} &=P(X_{n+m}=j|X_0=i) \\ &= \sum_{k\in S}P(X_{n+m}=j|X_0=i, X_m=k) \cdot P(X_m=k|X_0=i) \\ &= \sum_{k\in S}P(X_{n+m}=j|X_m=k) \cdot P(X_m=k|X_0=i) \quad \text{Markov property} \\ &= \sum_{k\in S}P(X_{n}=j|X_0=k) \cdot P(X_m=k|X_0=i) \\ &=\sum_{k\in S} p_{ik}^{(m)}\cdot P_{kj}^{(n)} \\ &=(P^{(m)}\cdot P^{(n)})_{ij} \\ &\Rightarrow P^{(n+m)} = P^{(m)}\cdot P^{(n)} \quad\quad (*) \end{aligned}

By definition, P(1)=PP^{(1)}=P

Note:

()(*) is called the Chapman-Kolmogorov equations (c-k equation). Entry-wise:

Pijn+m=kSPik(m)Pkj(n)P_{ij}^{n+m}=\sum_{k\in S}P_{ik}^{(m)}P_{kj}^{(n)}

Intuition:

ck-equation

"Condition at time mm (on XmX_m) and sum p all the possibilities"

4.2.2. Distribution of XnX_n

So far, we have seen transition probability Pij(n)=P(Xn=jX0=i)P_{ij}^{(n)}=P(X_n=j|X_0=i). This is not the probability P(Xn=j)P(X_n=j). In order to get this distribution, we need the information about which state the Markov chain starts with.

Let α0,i=P(X0=i)\alpha_{0,i}=P(X_0=i). The row vector α0=(α0,0,α0,1,...)\alpha_0=(\alpha_{0,0},\alpha_{0,1},...) is called the initial distribution of the Markov chain. This is the distribution of the initial state X0X_0

Similarly, we define distribution of XnX_n: αn=(αn,0,αn,1,...)\alpha_n=(\alpha_{n,0},\alpha_{n,1},...) where αn,i=P(Xn=i)\alpha_{n,i}=P(X_n=i)

Fact: αn=α0pn\alpha_n=\alpha_0\cdot p^n

Proof:

jSαn,j=P(Xn=j)=iSP(Xn=jX0=i)P(X0=i)=iSα0,iPij(n)=(α0P(n))j=(α0Pn)j\forall j \in S \\\\ \begin{aligned} \alpha_{n,j} &=P(X_n=j) \\ &= \sum_{i\in S} P(X_n=j|X_0=i)\cdot P(X_0=i) \\ &= \sum_{i\in S} \alpha_{0,i}\cdot P_{ij}^{(n)} \\ &=(\alpha_0\cdot P^{(n)})_j = (\alpha_0\cdot P^n)_j \\ \end{aligned}

αn=α0Pn\Rightarrow \alpha_n =\alpha_0\cdot P^n

Remark: The distribution of a DTMC is completely determined by two things:

4.3. Stationary distribution (invariant distribution)

Definition: A probability distribution π=(π0,π1,...)\pi = (\pi_0,\pi_1,...) is called a stationary distribution(invariant distribution) of the DTMC {Xn}n=0,1,...\{X_n\}_{n=0,1,...} with transition matrix PP, if :

  1. π=πP\underline{\pi}=\pi\cdot P
  2. iSπi=1(π1)\sum_{i\in S}\pi_i = 1 (\Leftrightarrow \underline{\pi} \cdot 1\!\!\!\!\perp). (11\!\!\!\!\perp: a column of all 1's)

Why such π\underline{\pi} is called stationary/invariant distribution?

iSπi=1,πi0,i=0,1,...distributionπ=πPinvariant/stationary. \sum_{i\in S} \pi_i = 1, \pi_i \geq 0, i=0,1,... \Rightarrow \text{distribution} \\ \underline{\pi} = \pi\cdot P \Rightarrow \text{invariant/stationary.}

Assume the MC starts from the initial distribution α0=π\alpha_0=\underline{\pi}. hen the distribution of X1X_1 is

α1=α0P=πP=π=α0 \alpha_1 =\alpha_0 \cdot P =\underline{\pi}\cdot P = \underline{\pi} = \alpha_0

The distribution of X2X_2:

α2=α0P2=πPP=πP=π=α0 \alpha_2=\alpha_0\cdot P^2 =\underline{\pi}\cdot P\cdot P = \underline{\pi} \cdot P = \underline{\pi} = \alpha_0

\cdots\cdots

αn=α0\alpha_n=\alpha_0

Thus, if the MC starts from a stationary distribution, then its distribution will not change over time.

Example 4.3.1

An electron has two states: ground(0),excited(1)ground(0), excited(1). Let Xn{0,1}X_n\in\{0,1\} be the state at time nn.

At each step, changes state with probability:

Then {Xn}\{X_n\} is a DTMC. Its transitional matrix is:

P={1ααβ1β}P=\begin{Bmatrix} 1-\alpha & \alpha \\ \beta & 1-\beta \end{Bmatrix}

Now let us solve for the stationary distribution π=πP\underline{\pi} =\underline{\pi}\cdot P.

(π0,π1)(1ααβ1β)=(π0,π1)(\pi_0,\pi_1) \begin{pmatrix} 1-\alpha & \alpha \\ \beta & 1-\beta \end{pmatrix} =(\pi_0,\pi_1)

{π0(1α)+π1β=π0(1)π0α+π1(1β)=π1(2)\Rightarrow \begin{cases} \pi_0(1-\alpha) + \pi_1\beta=\pi_0\quad(1) \\ \pi_0\alpha + \pi_1(1-\beta) = \pi_1\quad(2) \end{cases}

We have two equations and two unknowns. However, note that they are not linearly independent:

sum of LHS =π0+π1==\pi_0+\pi_1= sum of RHS. Hence (2)(2) can be derived from (1)(1). By (1)(1), we have:

απ0=βπ1orπ0π1=βα\alpha\pi_0=\beta\pi_1\quad\text{or}\quad\frac{\pi_0}{\pi_1}=\frac{\beta}{\alpha}

This where we need π1\underline{\pi}\cdot1\!\!\!\!\perp:

π0+π1=1π0=βα+β,π1=αα+β\pi_0+\pi_1=1 \Rightarrow\pi_0=\frac{\beta}{\alpha+\beta},\quad \pi_1 =\frac{\alpha}{\alpha+\beta}

Thus, we conclude that there exists a unique stationary distribution (βα+β,αα+β)=π(\frac{\beta}{\alpha+\beta},\frac{\alpha}{\alpha+\beta})=\underline{\pi}

The above procedure for solving for stationary distribution is typical:

  1. Use π=πP\underline{\pi}=\underline{\pi}P to get the properties between different components of π\underline{\pi}
  2. Use π1=1\underline{\pi}\cdot 1\!\!\!\!\perp = 1 to normalize (get exact values)

4.4. Classification of States

4.4.1. Transience and Recurrence

Let TT: be the waiting for a MC to visit/revisit state ii for the first time

Ti:=min{n>0:Xn=i}Ti is a r.v.T_i:=min\{n>0:X_n=i\}\quad\quad T_i \text{ is a r.v.}

Ti=T_i=\infty if the MC never (re)visits state ii.

Definition 4.4.1. Transience and Recurrence

A state ii is called:

Example 4.4.1

P=(121212121)P=\begin{pmatrix} \frac{1}{2} & \frac{1}{2} \\ \\ & \frac{1}{2} & \frac{1}{2} \\ \\ & & 1 \end{pmatrix}

Given X0=0X_0=0,

P(X1=0T0=1X0=0)=P(X1=1T0= since state 1and 2 do not go to 0X0=0)=12P(T0<X0=0)=12<1P(\underbrace{X_1=0}_{T_0=1}|X_0=0)=P(\underbrace{X_1=1}_{T_0=\infty \text{ since state 1}\atop\text{and 2 do not go to 0}}|X_0=0)=\frac{1}{2} \quad\Rightarrow\quad P(T_0<\infty|X_0=0)=\frac{1}{2}<1

Thus, state 00 is transient

Similarly, state 11 is transient.

Given X0=2X_0=2,

P(X1=2X0=2)P(T2<X0=2)=1P(X_1=2|X_0=2)\Rightarrow P(T_2<\infty|X_0=2) = 1

As E(T2X0=2)=1E(T_2|X_0=2)=1Thus, state 22 is a positive recurrence.

In general, the distribution of TiT_i is very hard to determine \Rightarrow need better criteria for recurrence/transience.

Criteria (1): Define fii=P(Ti<X0=i)f_{ii}=P(T_i<\infty|X_0=i), and

Vi=# of times that the MC (revisits) state i=n=11{Xn=i}V_i= \text{\# of times that the MC (revisits) state i}=\sum_{n=1}^\infty 1\!\!\!\!\perp_{\{X_n=i\}}

If state ii is transient

P(Vi=kX0=j)=fiikgoes back toi for k times(1fiinever visitsi again)P(V_i=k|X_0=j)=\underbrace{f_{ii}^k}_{\text{goes back to}\atop\text{$i$ for $k$ times}}(\underbrace{1-f_{ii}}_{\text{never visits}\atop\text{$i$ again}})

Vi+1Geo(1fii)\Rightarrow V_i+1\sim Geo(1-f_{ii})

In particular, P(Vi<X0=i)=1P(V_i<\infty|X_0=i)=1\Rightarrow If state ii is transient, it is visited away finitely many times with probability 1. The MC will leave state ii forever sooner or later.

On the other hand, if state ii is recurrent, then fii=1f_{ii}=1

P(Vi=k)=0k=0,1,...P(V1=)=1P(V_i=k)=0\quad k=0,1,... \Rightarrow P(V_1=\infty)=1

If the MC starts at a recurrent state ii, it will visit that state infinitely many times. \\

Criteria (2): In terms of E(ViX0=i)E(V_i|X_0=i):

E(ViX0=i)=11fii1=fii1fii<if fii<1,(i transient)E(ViX0=i)=,if fii=1,(i recurrent)\begin{aligned} &E(V_i|X_0=i)=\frac{1}{1-f_{ii}} - 1 = \frac{f_{ii}}{1-f_{ii}} < \infty &\text{if } f_{ii} < 1, (i \text{ transient}) \\ &E(V_i|X_0=i)=\infty, &\text{if } f_{ii}=1, (i \text{ recurrent}) \end{aligned}

Criteria (3): Note that

E(ViX0=i)=E(n=11{Xn=i}X0=i)=n=1E(1{Xn=i}X0=i)=n=1P(Xn=iX0=i)=n=1Pii(n)\begin{aligned} E(V_i|X_0=i) &= E(\sum_{n=1}^\infty 1\!\!\!\!\perp_{\{X_n=i\}}|X_0=i) \\ &= \sum_{n=1}^\infty E( 1\!\!\!\!\perp_{\{X_n=i\}}|X_0=i) \\ &= \sum_{n=1}^\infty P(X_n=i|X_0=i) \\ &=\sum_{n=1}^\infty P_{ii}^{(n)} \end{aligned}

n=1Pii(n)<if i transientn=1Pii(n)=if i recurrent\begin{aligned} \Rightarrow &\sum_{n=1}^\infty P_{ii}^{(n)} < \infty &\text{if } i \text{ transient} \\ \Rightarrow &\sum_{n=1}^\infty P_{ii}^{(n)} = \infty &\text{if } i \text{ recurrent} \\ \end{aligned}

To conclude,

define:irecureenttransientP(Ti<inftyX0=i)=1P(Ti<X0=i)<1P(Vi=X0=i)=1P(Vi<X0=i)=1E(ViX0=i)=E(ViX0=i)<easiest to use: n=1Pii(n)=n=1Pii(n)<define:\quad \begin{aligned} i\quad\quad & recureent & transient \\ &P(T_i<infty |X_0=i) = 1 \quad& P(T_i<\infty|X_0=i)<1 \\ &P(V_i=\infty|X_0=i)=1&P(V_i<\infty|X_0=i)=1 \\ &E(V_i|X_0=i)=\infty &E(V_i|X_0=i)<\infty \\ \text{easiest to use: } &\sum_{n=1}^\infty P_{ii}^{(n)}=\infty &\sum_{n=1}^\infty P_{ii}^{(n)}<\infty \end{aligned}

4.4.2. Periodicity

Example:

P=(1121212121)P= \begin{pmatrix} & 1 & & \\ \frac{1}{2} & & \frac{1}{2} & \\ &\frac{1}{2}&&\frac{1}{2}\\ &&1& \end{pmatrix}

Note that if we starts from 0, we can only get back to 0 in 2,4,6,2, 4, 6,\cdots, i.t., even number of steps P00(2n+1)=0,nP_{00}^{(2n+1)}=0,\quad \forall n

Definition 4.4.2. Period

The period of state ii is defined as

di=gcdgreatescommon divisor({n:Pii(n)>0i can go backto i in n steps})d_i=\underbrace{gcd}_{\text{greates}\atop\text{common divisor}}(\{n:\underbrace{P_{ii}^{(n)}>0}_{\text{$i$ can go back} \atop\text{to $i$ in $n$ steps} }\})

In this example above, d0=gcd({even numbers})=2d_0=gcd(\{\text{even numbers}\}) = 2

If di=1d_i=1, state ii is called "aperiodic"

If n>0\cancel{\exists} n > 0 such that Pii(n)>0P_{ii}^{(n)}>0, then di=d_i=\infty

Remark 4.4.2

Note that Pii>0di=1P_{ii} > 0 \Rightarrow d_i = 1. The converse is not true.

P00(2)>0,P00(3)>0d0=1 but P00=0P_{00}^{(2)} >0, P_{00}^{(3)} > 0 \Rightarrow d_0 =1 \text{ but } P_{00}=0

In general, di=dPii(d)>0d_i=d \cancel{\Rightarrow} P_{ii}^{(d)}>0

4.4.3. Equivalent classes and irreducibility

Definition 4.4.3.1. Assessable

Let {Xn}n=0,1,\{X_n\}_n=0,1,\cdots be a DTMC with state space SS. State jj is said to be assessable\text{\underline{assessable}} from state ii, denoted by iji\rightarrow j, if Pij(n)>0P_{ij}^{(n)}>0 for some n0n\geq 0.

Intuitively, ii can go to state jj in finite steps.

Definition 4.4.3.2. Communicate

If iji\rightarrow j and jij\rightarrow i, we say ii and jj communicate, denoted by iji\leftrightarrow j.

Fact 4.4.3.1

"Communication" is an equivalence relation.

  1. iji\leftrightarrow j then Pii(0)=1=P(X0=iX0=i)P_{ii}^{(0)}= 1 \bold{=} P(X_0=i|X_0=i) (Identity)
  2. iji\leftrightarrow j then jij\leftrightarrow i (symmetry)
  3. ij,jki\leftrightarrow j, j\leftrightarrow k, then iki\leftrightarrow k (transitivity)

Definition 4.4.3.3. Class

As a result, we can use "\leftrightarrow" to divide the state space into different classes, each containing only the states which communicate with each other.

{S=kCk({Cn} is a partition of S)CkCk=,k=k \begin{cases} S=\bigcup_kC_k \quad\quad\quad\quad \text{($\{C_n\}$ is a partition of $S$)}\\ C_k\bigcap C_k' = \emptyset, k\cancel{=}k' \end{cases}

Definition 4.4.3. Irreducible

A MC is called irreducible, if it has only one class. In other words, iji\leftrightarrow j for any i,jSi, j\in S


-Q: How to find equivalent classes?

-A: "Draw a graph and find the loops"

Example 4.4.3.1. Find the classes

P=(121212121414141) P=\begin{pmatrix} \frac{1}{2} & \frac{1}{2} & & \\ \\ \frac{1}{2}&\frac{1}{2}&&\\ \\ \frac{1}{4}&\frac{1}{4}&\frac{1}{4}&\\ \\ &&&1 \end{pmatrix}

Draw an arrow from ii to jj if Pij>0P_{ij} > 0

\Rightarrow 3 classes: {0,1},{2},{3}\{0,1\}, \{2\}, \{3\}

Example 4.4.3.2. Find the classes

P=(1212121212121212) P=\begin{pmatrix} \frac{1}{2} & \frac{1}{2} & & \\ \\ \frac{1}{2}&&\frac{1}{2}&\\ \\ \frac{1}{2}&&&\frac{1}{2}\\ \\ &&\frac{1}{2}&\frac{1}{2} \end{pmatrix}

\Rightarrow This MC is irreducible

Fact 4.4.3.2

Preposition Transience/Recurrence are class properties. That is, if iji\leftrightarrow j, then jj is transient/recurrent if and only if ii is transient/recurrent

Proof:

Suppose ii is recurrent, then k=1Pii(k)=\sum_{k=1}^\infty P_{ii}^{(k)}=\infty

Since iji \leftrightarrow j, m,n\exists m,n such that Pij(m)>0,Pij(n)>0P_{ij}^{(m)}>0, P_{ij}^{(n)}>0

Note that

Pjj(m+n+k)P(Xm+n+k=jX0=j)Pji(n)Pii(k)Pij(m)P(Xm+n+k=j,Xn+k=i,Xn=iX0=j)l=1Pjj(l)l=m+n+1Pjj(l)=k=1Pjj(m+n+k)k=1Pji(n)Pii(k)Pji(m)=Pjj(n)0Pij(m)0 k=1Pii(k)=\begin{aligned} \underbrace{P_{jj}^{(m+n+k)}}_{P(X_{m+n+k}=j|X_0=j)} \geq \underbrace{P_{ji}^{(n)}P_{ii}^{(k)}P_{ij}^{(m)}}_{P(X_{m+n+k}=j, X_{n+k}=i, X_n=i|X_0=j)} \Rightarrow \sum_{l=1}^\infty P_{jj}^{(l)} &\geq \sum_{l=m+n+1}^\infty P_{jj}^{(l)} \\ & = \sum_{k=1}^\infty P_{jj}^{(m+n+k)} \\ & \geq \sum_{k=1}^\infty P_{ji}^{(n)}P_{ii}^{(k)}P_{ji}^{(m)} \\ & = \underbrace{P_{jj}^{(n)}}_{0}\underbrace{P_{ij}^{(m)}}_{0}\ \underbrace{\sum_{k=1}^\infty P_{ii}^{(k)}}_\infty = \infty \end{aligned}

Thus, jj is recurrent. Symmetrically, jj is recurrent \Rightarrow ii is recurrent

Thus,

For irreducible MC, since recurrence and transience are class properties, we also say the Markov Chain is recurrent/transient

Definition 4.4.3.5. Proposition

If an irreducible MC has a finite state space, then it is recurrent

Idea of proof

If the MC is transient, then with probability 1, each state has a last visit time. Finite states \Rightarrow \exists a last visit time for all the states. As a result, the MC has nowhere to go after that time. \Rightarrow Contradiction.

Remark 4.4.3.1

We can actually prove that the MC must be positive recurrent, if the state space is finite and the MC is irreducible.

Theorem 4.4.3.1

Periodicity is a class property: ijdi=dji\leftrightarrow j\Rightarrow d_i=d_j.

For an irreducible MC, its period is defined as the period of any state.

4.5. Limiting Distribution

In this part, we are interested in limnPij(n)lim_{n\rightarrow\infty} P_{ij}^{(n)} and limnP(Xn=i)lim_{n\rightarrow\infty}P(X_n=i)

To make things simple, we focus on the irreducible case.

Theorem 4.5.1. Basic Limit Theorem

Let {Xn}n=0,1,\{X_n\}_{n=0,1,\ldots} be an irreducible, aperiodic, positive recurrent DTMC. Then a unique stationary distribution:

π=(π0,π1,) exits\underline{\pi}=(\pi_0,\pi_1,\ldots) \text{ exits}

Moreover:

()limnPij(n)limiting distribution(does not depend on the initial state i)=limnk=1nI{Xk=j}nlong-run fraction of time spent in j=1E(TjX0=j)Tj=min{n>0:Xn=j}expected revisit time=πj,i,jS (*)\underbrace{lim_{n\rightarrow\infty} P_{ij}^{(n)}}_{\text{limiting distribution}\atop\text{(does not depend on the initial state i)} } =lim_{n\rightarrow\infty} \underbrace{\frac{\sum_{k=1}^n\mathbb{I}_{\{X_k=j\}}}{n}}_{\text{long-run fraction of time spent in j}} =\underbrace{\frac{1}{\mathbb{E}(T_j|X_0=j)}}_{\text{$T_j=min\{n>0:X_n=j\}$}\atop\text{expected revisit time}} =\pi_j\quad\quad, i,j\in S

Limiting distribution =

Remark 4.5.1

The result ()(*) is still true if the MC is null recurrent, where all the terms are 0, and π\underline{\pi} is no longer a distribution. (in other words, there does not exist a stationary distribution)

Remark 4.5.2

If {Xn}n=0,1,\{X_n\}_{n=0,1,\ldots} has a period d>1d>1:

limnPjj(nd)d=limnk=1nIIII{Xk=j}n=1E(TjX0=j)=πj \frac{\lim_{n\rightarrow\infty} P_{jj}^{(nd)}}{d} = \lim_{n\rightarrow\infty}\frac{\sum_{k=1}^n\mathbb{I}\mathbb{I}\mathbb{I}\mathbb{I}_{\{X_k=j\}}}{n} = \quad\frac{1}{\mathbb{E}(T_j|X_0=j)}=\pi_j

Back to the aperiodic case. SInce the limit limnPij(n)\lim_{n\rightarrow\infty}P_{ij}^{(n)} does not depend on ii, limnPij(n)=πj\lim_{n\rightarrow\infty}P_{ij}^{(n)}=\pi_j is also the limiting(marginal) distribution at state jj:

limnαn,j=limnP(Xn=j)=πj\lim_{n\rightarrow\infty}\alpha_{n,j} = \lim_{n\rightarrow\infty}P(X_n=j)=\pi_j

regardless of the initial distribution α0\alpha_0

Detail:

limnαn,j=limn(α0p(n))j=limniSα0,iPij(n)=iSlimnα0,iPij(n)=iSα0,ilimnPij(n)=(iSα0,i)πj=πj\begin{aligned} \lim_{n\rightarrow\infty}\alpha_{n,j} & = \lim_{n\rightarrow\infty}(\alpha_0\cdot p^{(n)})_j \\ & = \lim_{n\rightarrow\infty}\sum_{i\in S}\alpha_{0,i}\cdot P_{ij}^{(n)} \\ & = \sum_{i\in S}\lim_{n\rightarrow\infty}\alpha_{0,i}\cdot P_{ij}^{(n)} \\ & = \sum_{i\in S}\alpha_{0,i}\lim_{n\rightarrow\infty}\cdot P_{ij}^{(n)} \\ & = (\sum_{i\in S}\alpha_{0,i})\pi_j \\ & = \pi_j \end{aligned}

Why are the conditions in the Basic Limit Theorem necessary?

Example 4.5.1

Consider a MC with

p=(1212121212121212) p=\begin{pmatrix} \frac{1}{2} & \frac{1}{2} & & \\ \frac{1}{2} & \frac{1}{2} & & \\ & & \frac{1}{2} & \frac{1}{2} \\ & & \frac{1}{2} & \frac{1}{2} \end{pmatrix}

Two classes: {0,1},{2,3}\{0,1\}, \{2, 3\} \Rightarrow it is not irreducible. All the states are still aperiodic, positive recurrent

This MC can be decomposed into two MC's:

State 0,10, 1, with

p1=(12121212)irreducible p_1=\begin{pmatrix} \frac{1}{2} & \frac{1}{2} \\\\ \frac{1}{2} & \frac{1}{2} \end{pmatrix} \quad\quad \text{irreducible}

State 2,32, 3, with

p1=(12121212)irreducible p_1=\begin{pmatrix} \frac{1}{2} & \frac{1}{2} \\\\ \frac{1}{2} & \frac{1}{2} \end{pmatrix} \quad\quad \text{irreducible}

And

p=(P1P2) p=\begin{pmatrix} P_1 & \\ & P_2 \end{pmatrix}

Note that both (12,12,0,0)(\frac{1}{2},\frac{1}{2},0,0) and (0,0,12,12)(0,0,\frac{1}{2},\frac{1}{2}) are stationary distributions. Consequently, any convex combination of these two distributions, of the form:

a(12,12,0,0)+(1a)(0,0,12,12),a{0,1}a(\frac{1}{2},\frac{1}{2},0,0) + (1-a)(0,0,\frac{1}{2},\frac{1}{2})\quad, a\in\{0,1\}

is also a stationary distribution

Thus, irreducibility is related to the uniqueness of the stationary distribution.

Correspondingly, the limiting transition probability will depend on ii:

limnP00(n)=(limnP1n)00=12\lim_{n\rightarrow\infty}P_{00}^{(n)} = (\lim_{n\rightarrow\infty}P_1^n)_{00}=\frac{1}{2}

but limnP20(n)=0\lim_{n\rightarrow\infty}P_{20}^{(n)}=0

Example 4.5.2

Consider a MC with

P=(0110) P = \begin{pmatrix} 0 & 1\\ 1 & 0 \end{pmatrix}

Irreducible, positive recurrent, but not aperiodic:d=2d=2

Note that P2=(11)=IP2n=(11),P2n+1=P=(11)P^2=\begin{pmatrix}1&\\&1\end{pmatrix}=I\Rightarrow P^{2n}=\begin{pmatrix}1&\\&1\end{pmatrix}, P^{2n+1}=P=\begin{pmatrix}&1\\1\end{pmatrix}

P00(n)=1P_{00}^{(n)}=1 for nn even, 00 for nn odd \Rightarrow limnP00(n)\lim_{n\rightarrow\infty}P_{00}^{(n)} does not exist.

Aperiodicity is related to the existence of the limit limnPij(n)\lim_{n\rightarrow\infty}P_{ij}^{(n)}

Example 4.5.3

P0,j=pj,j=0,1,,p0>0Pi,i1=1,i1 P_{0,j} = p_j, j=0,1,\cdots, p_0>0 \\ P_{i,i-1}=1, i\geq 1

Given X0=0X_0=0, T0=n+1T_0=n+1 if and only if X1=nX_1=n. his happens with prob pnp_n.

E(T0X0=0)=n=0(n+1)pn=1+n=0npn\begin{aligned} \Rightarrow\mathbb{E}(T_0|X_0=0) &=\sum_{n=0}^\infty (n+1)p_n \\ &=1 + \sum_{n=0}^\infty np_n \end{aligned}

We can construct pnp_n such that n=0npn=\sum_{n=0}^\infty np_n=\infty. (For example, p0=12,p2=14,p4=14,p_0=\frac{1}{2}, p_2=\frac{1}{4}, p_4=\frac{1}{4},\cdots)

In this case, the chain is null recurrent. It is irreducible and aperiodic (P00=p0>0P_{00}=p_0>0)

A stationary distribution does not exist. Reason:

p=(p0p1p2pi1011)πP=πp= \begin{pmatrix} p_0 & p_1& p_2 & \cdots & p_i & \cdots \\ 1 && 0\\ &1\\ \cdots\\ &&&&1 \end{pmatrix} \\ \underline{\pi}\cdot P=\underline{\pi} \Rightarrow

p0π0+π1=π0p1π0+π2=π1pi1π0+πi=πi1piπ0+πi+1=πI\begin{aligned} &p_0\pi_0 + \pi_1 = \pi_0 \\ &p_1\pi_0 + \pi_2 = \pi_1 \\ &\quad\quad\vdots\\ &p_{i-1}\pi_0 + \pi_i = \pi_{i-1} \\ &p_i\pi_0 + \pi_{i+1} = \pi_I \\ \end{aligned}

Add the first ii equations:

(p0++pi1)π0+(π1+π2++πi)=π0++πi1(p0++pi1)π0+πi=π0πi=(1(po++pi1))π0=k=ipkπ0 (p_0+\cdots+p_{i-1})\pi_0 + (\cancel{\pi_1}+\cancel{\pi_2}+\cancel{\cdots}+\pi_i) = \pi_0 + \cancel{\cdots} + \cancel{\pi_{i-1}}\\ (p_0+\cdots+p_{i-1})\pi_0 + \pi_i = \pi_0\\ \Rightarrow \pi_i=(1-(p_o+\cdots+p_{i-1}))\pi_0 = \sum_{k=i}^\infty p_k\pi_0

Try to normalize:

1=i=1πi=i=0kipkπ0=kii=0pkπ0=kipki=0π0=(ki(k+1)pk)π0π0=0,pii=0i \begin{aligned} 1 &= \sum_{i=1}^\infty \pi_i \\ &= \sum_{i=0}^\infty\sum_{k_i}^\infty p_k\pi_0 \\ &= \sum_{k_i}^\infty \sum_{i=0}^\infty p_k\pi_0 \\ &= \sum_{k_i}^\infty p_k \sum_{i=0}^\infty\pi_0 \\ &= (\underbrace{\sum_{k_i}^\infty (k+1)p_k }_{\infty})\pi_0 \end{aligned} \\ \Rightarrow \pi_0=0\quad,\quad pi_i=0 \quad\forall i

This is not a distribution. Thus, a stationary distribution does not exist.

positive recurrence is related to the existence of the stationary distribution

Example 4.5.4. Electron

P=(1ααβ1β)α,β(0,1)P=\begin{pmatrix} 1-\alpha & \alpha \\ \beta & 1-\beta \end{pmatrix} \quad \alpha,\beta\in(0,1)

Irreducible, aperiodic, positive recurrence.

In order to find of PnP^n; we use the diagonalization technique.

P=QΛQ1where Λ is diagonalΛ=(1001αβ)Q=(1α11β)Q1=1α+β(βα11)P=Q\Lambda Q^{-1} \quad\text{where $\Lambda$ is diagonal}\\\\ \Lambda = \begin{pmatrix} 1 & 0 \\ 0 & 1-\alpha-\beta \end{pmatrix} \quad Q=\begin{pmatrix} 1 & \alpha \\ 1 & 1-\beta \end{pmatrix} \quad Q^{-1}=\frac{1}{\alpha+\beta}\begin{pmatrix} \beta & \alpha\\ 1 & -1 \end{pmatrix}

Then

Pn=(QΛQ1)(QΛQ1)(QΛQ1)=QΛnQ1=(1α1β)(1(1αβ)n)1α+β(βα11)=1α+β(β+α(1αβ)nαα(1αβ)nββ(1αβ)nα+β(1αβ)n)limnPn=1α+β(βαβα)=(βα+βαα+ββα+βαα+β) \begin{aligned} P^n &= (Q\Lambda \cancel{Q}^{-1})(\cancel{Q}\Lambda \cancel{Q}^{-1})\cdots(\cancel{Q}\Lambda Q^{-1}) \\ &= Q\Lambda^nQ^{-1} \\ &= \begin{pmatrix} 1 & \alpha \\ 1 & -\beta \end{pmatrix} \begin{pmatrix} 1 \\ & (1-\alpha-\beta)^n \end{pmatrix} \frac{1}{\alpha+\beta} \begin{pmatrix} \beta & \alpha \\ 1 & -1 \end{pmatrix} \\ &=\frac{1}{\alpha+\beta} \begin{pmatrix} \beta+\alpha(1-\alpha-\beta)^n & \alpha-\alpha(1-\alpha-\beta)^n \\ \beta-\beta(1-\alpha-\beta)^n & \alpha+\beta(1-\alpha-\beta)^n \end{pmatrix}\\ &\Rightarrow \lim_{n\rightarrow\infty}P^n=\frac{1}{\alpha+\beta}\begin{pmatrix} \beta & \alpha \\ \beta & \alpha \end{pmatrix} =\begin{pmatrix} \frac{\beta}{\alpha+\beta} & \frac{\alpha}{\alpha+\beta} \\ \\ \frac{\beta}{\alpha+\beta} & \frac{\alpha}{\alpha+\beta} \end{pmatrix} \end{aligned}

Note that limnPn\lim_{n\rightarrow\infty} P^n has identical rows. This corresponds to the result that limnPij(n)\lim_{n\rightarrow\infty}P_{ij}^{(n)} does not depend on ii

We saw that the stationary distribution π=(βα+β,αα+β)\underline{\pi}=(\frac{\beta}{\alpha+\beta},\frac{\alpha}{\alpha+\beta}). So we verity that πj=limnPij(n)\pi_j=\lim_{n\rightarrow\infty} P_{ij}^{(n)}

Also, given X0=0X_0=0, P(T0=1X0=0)=1α\mathbb{P}(T_0=1|X_0=0)=1-\alpha .

For k=2,3,k=2,3,\cdots

P(T0=kX0=0)=P(Xk=0,Xk1=1,,X1=1X0=0)=α(1β)k2βE(T0X0=0)=1(1α)+k=2α(1β)k2βk=1α+k=1α(1β)k2β(k1)E(Geo(β))+k=2α(1β)k2βpmf of Geo(β)=1α+αk=1(1β)k2β(k1)+k=2α(1β)k2β=1α+α1β+α1=1α+αβ+α=α+ββ \begin{aligned} \mathbb{P}(T_0=k|X_0=0) &= \mathbb{P}(X_k=0,X_{k-1}=1,\cdots, X_1=1|X_0=0) \\ &=\alpha(1-\beta)^{k-2}\beta\\ &\Rightarrow \mathbb{E}(T_0|X_0=0) \\ &=1\cdot(1-\alpha)+\sum_{k=2}^\infty \alpha(1-\beta)^{k-2}\beta k \\ &=1-\alpha+\sum_{k=1}^\infty\underbrace{\alpha(1-\beta)^{k-2}\beta(k-1)}_{\mathbb{E}(Geo(\beta))}+\sum_{k=2}^\infty\alpha\underbrace{(1-\beta)^{k-2}\beta}_{\text{pmf of Geo($\beta$)}}\\ &=1-\alpha+\alpha\sum_{k=1}^\infty(1-\beta)^{k-2}\beta(k-1)+\sum_{k=2}^\infty\alpha(1-\beta)^{k-2}\beta\\ &=1\alpha+\alpha\cdot\frac{1}{\beta}+\alpha\cdot 1 \\ &=1-\cancel{\alpha}+\frac{\alpha}{\beta}+\cancel{\alpha}\\ &=\frac{\alpha+\beta}{\beta} \end{aligned}

Hence we verify that E(T0X0=0)=1π0\mathbb{E}(T_0|X_0=0)=\frac{1}{\pi_0}

4.6. Generating function and branching processes

Definition 4.6.1

Let p=(p0,p1,)\underline{p}=(p_0,p_1,\cdots) be a distribution on {0,1,2,}\{0,1,2,\cdots\}. Let ξ\xi be a r.v. following distribution p\underline{p}. That is P(ξ=i)=pi\mathbb{P}(\xi=i)=p_i. Then the generating function of ξ\xi, or of p\underline{p}, is defined by

ψ(s)=E(sξ)=k=0pkskfor0s1 \begin{aligned} \psi(s)&=\mathbb{E}(s^\xi) \\ &=\sum_{k=0}^\infty p_ks^k\quad\quad for 0\leq s\leq 1 \end{aligned}

Properties of generating function

  1. ψ(0)=p0,ψ(1)=k=0pk=1\psi(0)=p_0,\quad\psi(1)=\sum_{k=0}^\infty p_k=1

  2. Generating function determines the distribution

    pk=1k!dkψ(s)dsks=0p_k=\frac{1}{k!}\frac{d^k\psi(s)}{ds^k}|_{s=0}

    Reason:

    ψ(s)=p0+p1s1++pk1sk1+pksk+pk+1sk+1+\psi(s)=p_0+p_1s^1+\cdots+p_{k-1}s^{k-1}+p_ks^k+p_{k+1}s^{k+1}+\cdots \\

    dkψ(s)dsk=k!pk+()s+()s2+\frac{d^k\psi(s)}{ds^k}=k!p_k +(\cdots)s +(\cdots)s^2+\cdots

    dkψ(s)dsks=0=k!pk\frac{d^k\psi(s)}{ds^k}|_{s=0}=k!p_k

    In particular, p10ψ(s)p_1\geq 0\Rightarrow\psi(s) is increasing. p20ψ(s)p_2\geq 0\Rightarrow \psi(s) is climax

  3. Let ξ1,...,ξn\xi_1,...,\xi_n be independent r.b. with generating function ψ1,...,ψn\psi_1,...,\psi_n,

    X=ξ1+...+ξnψX(s)=ψ1(s)ψ2(s)...ψn(s)X=\xi_1+...+\xi_n \Rightarrow \psi_X(s)=\psi_1(s)\psi_2(s)...\psi_n(s)

    Proof:

    ψX(s)=sX(independent)=E(sξ1sξ2...sξn)=E(sξ1)...E(sξn)=ψ1(s)...ψn(s) \begin{aligned} \psi_X(s) &= \mathbb{s^X} \\ (independent) &= \mathbb{E}(s^{\xi_1}s^{\xi_2}...s^{\xi_n}) \\ &= \mathbb{E}(s^{\xi_1})...\mathbb{E}(s^{\xi_n})\\ &= \psi_1(s)...\psi_n(s) \end{aligned}

  4. dψ(s)dsks=1=dkE(sξ)dsks=1=E(dksξdsks=1=E(ξ(ξ1)(ξ2)...(ξk+1)sξk)s=1=E(ξ(ξ1)...(ξk+1))\frac{d^\psi(s)}{ds^k}\bigg|_{s=1} = \frac{d^k\mathbb{E}(s^\xi)}{ds^k}\bigg|_{s=1} = \mathbb{E}(\frac{d^ks^\xi}{ds^k}\bigg|_{s=1} = \mathbb{E}(\xi(\xi-1)(\xi-2)...(\xi-k+1)s^{\xi-k})\bigg|_{s=1} = \mathbb{E}(\xi(\xi-1)...(\xi-k+1))

    In particular, E(ξ)=ψ(1)\mathbb{E}(\xi) = \psi'(1) and Var(ξ)=E(ξ2)(E(ξ))2=E(ξ2ξ)+E(ξ)(E(ξ))2=ψ(1)+ψ(1)(ψ(1))2Var(\xi)=\mathbb{E}(\xi^2)-(\mathbb{E}(\xi))^2=\mathbb{E}(\xi^2-\xi)+\mathbb{E}(\xi)-(\mathbb{E}(\xi))^2 = \psi''(1)+\psi(1)-(\psi'(1))^2

    Graph of a g.f.:

4.6.1. Branching Process

Each organism, at the end of its life, produces a random number YY of offsprings.

P(Y=k)=Pk,k=0,1,2,...,Pk0,k=0Pk=1\mathbb{P}(Y=k)=P_k, \quad k=0,1,2,..., \quad P_k\geq 0,\quad \sum_{k=0}^\infty P_k=1

The number of offsprings of different individuals are independent.

Start from one ancestor X0=1X_0=1, Xn:X_n: # of individuals(population in nn-th generation)

Then Xn+1=Y1(n)+Y2(n)+...+YXn(n)X_n+1=Y_1^{(n)}+Y_2^{(n)}+...+Y_{X_n}^{(n)}, where Y1(n),...,YXn(n)Y_1^{(n)},...,Y_{X_n}^{(n)} are independent copies of Y,Yi(n)Y, Y_i^{(n)} is the number of offsprings of the ii-th individual in the nn-th generation

4.6.1.1. Mean and Variance

Mean: E(Xn)\mathbb{E}(X_n) and Variance: Var(Xn)Var(X_n)

Assume, E(Y)=μ,Var(Y)=σ2\mathbb{E}(Y)=\mu, Var(Y)=\sigma^2.

E(Xn+1)=E(Y1(n)+...+YXn(n))=E(E(Y1(n)+...+YXn(n)Xn))=E(Xnμ)Wald’s identity(tutorial 3)=μE(Xn) \begin{aligned} \mathbb{E}(X_{n+1}) &= \mathbb{E}(Y_1^{(n)}+...+Y_{X_n}^{(n)}) \\ &= \mathbb{E}(\mathbb{E}(Y_1^{(n)}+...+Y_{X_n}^{(n)}|X_n)) \\ &= \mathbb{E}(X_n\mu) \\ \text{Wald's identity(tutorial 3)} \quad &= \mu\mathbb{E}(X_n)\\ \end{aligned}

E(Xn)=μE(Xn1)=μ2E(Xn2)=μnE(X0)=μn,n=0,1,...\begin{aligned} \Rightarrow \mathbb{E}(X_n) &=\mu\mathbb{E}(X_{n-1}) \\ &=\mu^2\mathbb{E}(X_{n-2}) \\ &\quad\quad\quad\vdots \\ &=\mu^n\mathbb{E}(X_0) = \mu^n,\quad n=0,1,... \end{aligned}

Var(Xn+1)=E(Var(Xn+1Xn)+Var(E)Xn+1Xn)E(Var(Xn+1Xn))=E(Var(Y1(n)+...+YXN(N)XN))=E(Xnσ2)=σ2μnVar(E(Xn+1Xn))=Var(μXn)=μ2Var(Xu) Var(Xn+1)=σ2μn+μ2Var(Xn))Var(X1)=σ2Var(X2)=σ2μ+μ2σ2=σ2(μ1+μ2)Var(X3)=σ2μ2+μ2(σ2(μ1+μ2))=σ2(μ2+μ3+μ4)In general, (can be proved by induction)Var(Xn)=σ(μn1+...+μ2n2)={σ2μn11μn1μμ=1σ2nμ=1\begin{aligned} Var(X_{n+1}) &= \mathbb{E}(Var(X_{n+1}|X_n)+Var(\mathbb{E})X_{n+1}|X_n) \\ \\ \begin{aligned} \mathbb{E}(Var(X_{n+1}|X_n)) &=\mathbb{E}(Var(Y_1^{(n)+...+Y_{X_N}^{(N)}}|X_N))\\ &=\mathbb{E}(X_n\cdot\sigma^2) \\ &= \sigma^2\mu^n \end{aligned}\quad\quad &\begin{aligned} Var(\mathbb{E}(X_{n+1}|X_n)) &= Var(\mu X_n) \\ &= \mu^2Var(X_u)\ \\ &\begin{aligned} \Rightarrow &Var(X_{n+1}) = \sigma^2\mu^n+\mu^2Var(X_n))\\ &Var(X_1)=\sigma^2 \\ &Var(X_2)=\sigma^2\mu + \mu^2\sigma^2=\sigma^2(\mu^1+\mu^2) \\ &Var(X_3)=\sigma^2\mu^2+\mu^2(\sigma^2(\mu^1+\mu^2)) = \sigma^2(\mu^2 + \mu^3 + \mu^4)\\ &\quad\quad\quad\vdots\\ &\text{In general, (can be proved by induction)}\\ &\begin{aligned} Var(X_n)&=\sigma(\mu^{n-1}+...+\mu^{2n-2})\\ &=\begin{cases} \begin{aligned} &\sigma^2\mu^{n-1}\frac{1-\mu^n}{1-\mu} \quad&\mu\cancel{=}1\\ &\sigma^2n &\mu=1 \end{aligned} \end{cases} \end{aligned} \end{aligned} \end{aligned} \end{aligned}

4.6.1.2. Extinction Probability

Q: What is the probability that the population size is eventually reduced to 0

Note that for a branching process, Xn=0Xk=0X_n=0\Rightarrow X_k=0 for all knk\geq n. Thus, state 00 is absorbing. (P00=1)(P_{00}=1). Let NN be the time that extinction happens.

N=min{n:Xn=0}N=min\{n:X_n=0\}

Define

Un=P(Nnextinction happensbefore or at n)=P(Xn=0)U_n=\mathbb{P}(\underbrace{N\leq n}_{\text{extinction happens}\atop\text{before or at n}})=\mathbb{P}(X_n=0)

Then UnU_n is increasing in nn, and

u=limnUn=P(N<)=P(the extinction eventually happens)=extinction probability\begin{aligned} u=\lim_{n\rightarrow\infty}U_n &= \mathbb{P}(N<\infty) \\ &= P(\text{the extinction eventually happens}) \\ &= \text{extinction probability} \end{aligned}

Out goal : find uu

We have the following relation between UnU_n and Un1U_{n-1}:

Un=k=0Pk(Un1)k=ψgf of Y(Un1)U_n=\sum_{k=0}^\infty P_k(U_{n-1})^k = \underbrace{\psi}_{\text{gf of Y}}(U_{n-1})

Each subpopulation has the same distribution as the whole population.

Total population dies out in nn steps if and only if each subpopulation dies out int n1n-1 steps

Un=P(Nn)=kP(NnX1=k)P(X1=k)=Pk=kP(N1n1# of steps for subpopulation 1 to die out,,Nkn1X1=k)Pk=kPkUn1k=E(Un1Y)=ψ(Un1) \begin{aligned} U_n &= \mathbb{P}(N\leq n) \\ &= \sum_k \mathbb{P}(N\leq n|X_1 = k)\underbrace{\mathbb{P}(X_1=k)}_{=P_k} \\ &=\sum_k \mathbb{P}(\underbrace{N_1\leq n-1}_{\text{\# of steps for subpopulation 1 to die out}},\cdots, N_k\leq n-1|X_1=k)\cdot P_k \\ &= \sum_k P_k\cdot U_{n-1}^k \\ &= \mathbb{E}(U_{n-1}^Y) \\ &= \psi(U_{n-1}) \end{aligned}

Thus, the question is :

\quad With initial value U0=0U_0=0 (since X0=1X_0=1), relation Un=ψ(Un1)U_n=\psi(U_{n-1}). What is limnUN=u\lim_{n\rightarrow\infty}U_N=u?

Recall that we have

  1. ψ(0)=P00\psi(0)=P_0\geq 0
  2. ψ(1)=1\psi(1) = 1
  3. ψ(s)\psi(s) is increasing
  4. ψ(s)\psi(s) is convex

Draw ψ(s)\psi(s) and function f(s)=sf(s)=s between 0 and 1, we have two cases:

The extinction probability uu will be the smallest intersection of ψ(s)\psi(s) and f(s)f(s). Equivalently, it is the smallest solution of the equation ψ(s)=s\psi(s)=s between 0 and 1. Draw ψ(s)\psi(s) and function f(s)=sf(s)=s between 0 and 1, we have two cases:

Reason: See the dynamics on a graph

\Rightarrow Case 1:u<1Case 2:u=1 (extinction happens for sure.)\begin{aligned} &\text{Case } 1: &u < 1 \\ &\text{Case } 2: &u=1 & \text{ (extinction happens for sure.)} \end{aligned}.

Q: How to tell if we are in case 1 or in case ?

A: check ψ(1)=E(Y)\psi'(1)=\mathbb{E}(Y)

ψ1(1)>1 Case 1ψ1(1)1 Case 2\begin{aligned} & \psi'1(1) > 1 &\rightarrow &\text{ Case }1\\ & \psi'1(1) \leq 1 &\rightarrow &\text{ Case }2 \end{aligned}

Thus, we conclude:

5. Poisson Processes

5.1. Counting Process

DTMC is a discrete-time process. That is, the index set T={0,1,2,...}T=\{0,1,2,...\} and {Xn}n=01,2,3,\{X_n\}_{n=01,2,3,\cdots}.

We also want to consider the cases where time can be continuous,

Continuous-time processes: T=[0,}\text{Continuous-time processes: } T=[0,\infty\}

{Xt}t0 or {X(t)}t0\{X_t\}_{t\geq 0} \text{ or } \{X_{(t)}\}_{t\geq 0}

The simplest type of continuous-time process is counting process, which counts the number of occurrence of certain event before time tt.

Definition 5.1.1. Counting Process N(t)N(t)

Let 0S1S20\leq S_1\leq S_2\leq\cdots be the time of occurrence of some events. Then, the process

N(t):=#{n:Snt}=n=11{Snt} \begin{aligned} N(t) &:= \#\{n:S_n\leq t\} \\ &=\sum_{n=1}^\infty 1\!\!\!\perp_{\{S_n\leq t\}} \end{aligned}

is called the counting process (of the events {Sn}n=1,2,...\{S_n\}_{n=1,2,...})

Equivalently, N(t)=n    Snt<Sn+1N(t) = n \iff S_n\leq t< S_{n+1}

Example 5.1.1

Calls arrive at a call center.

Other examples: cars passing a speed reader, atoms having radioactive decay, ...

Properties of a counting process

  1. N(t)0,t0N(t)\geq 0, t\geq 0
  2. N(t)N(t) takes integer values
  3. N(t)N(t) is increasing.
  4. N(t)N(t) is right-continuous

We also assume:

5.2. Definition of Poisson Process

Interarrival Times

Definition 5.2.1. Renewal Process

A renewal process is a counting process for which the interarrival times W1,W2,...W_1, W_2, ... are independent and identical

ALl the three processes examples of counting processes can be reasonably modeled as renewal processes.

Definition 5.2.2. Poisson Process

Poisson Process {X(t)}t0\{X_{(t)}\}_{t\geq0} is a renewal process for which the interarrival times are exponentially distributed:

Wni.i.dExp(λ)W_n \stackrel{i.i.d}{\sim} Exp(\lambda)

A Poisson process {N(t)}t0\{N(t)\}_{t\geq 0} can be denoted as

{N(t)}Poi(λintensityt)\{N(t)\}\sim Poi(\underbrace{\lambda}_{intensity} t)

Recall : Properties of the Exponential Distributions

XExp(λ)X\sim Exp(\lambda)

5.3. Properties of Poisson Processes

5.3.1. Continuous-time Markov Property

P(N(tm)=jN(tm1)=i,N(tm2)=im2,,N(t1)=i1)P(N(tm)=jN(tm1=i))\begin{aligned} &\mathbb{P}(N(t_m)=j|N(t_{m-1})=i,N(t_{m-2})=i_{m-2},\cdots, N(t_1)=i_1) \\ &\mathbb{P}(N(t_m)=j|N(t_{m-1}=i)) \end{aligned}

for any mm, t1<<tmt_1<\cdots<t_m, i1,i2,,im2,i,jSi_1,i_2,\cdots, i_{m-2},i,j\in S

Fact 5.3.1.1

The Poisson Process is the only renewal process having the Markov Property

Reason:

Since the exponential distribution is memoryless, the future arrival times will not depend on how long we have waited \Rightarrow The future of the counting process only depends on its current value.

In fact,

P(N(t+s)=jN(s)+j)time homogeneity=P(N(t)=jN(0)=i)only difference by which number we start ti ciybt=P(N(t)=jiN(0)=0) \begin{aligned} &\mathbb{P}(N(t+s)=j|N(s)+j) \\ \text{time homogeneity}&= \mathbb{P}(N(t)=j|N(0)=i) \quad\text{only difference by which number we start ti ciybt}\\ &= \mathbb{P}(N(t)=j-i|N(0)=0) \end{aligned}

5.3.1.1. Independent Increments

The Poisson Process has independent increments

t1<t2<t3<t4N(t2)N(t1)incrementsN(t4)N(t3)increments \begin{aligned} t_1<t_2<t_3<t_4 \Rightarrow \underbrace{N(t_2) - N(t_1)}_{\text{increments}} \perp\!\!\!\perp \underbrace{N(t_4)-N(t_3)}_{\text{increments}} \end{aligned}

Reasons:

Memoryless property of exponential distribution.

5.3.1.2. Poisson Increments

The Poisson Process has Poisson increments

N(t2)N(t1)Poi(λ(t2t1))N(t_2)-N(t_1)\sim Poi(\lambda(t_2-t_1))

Reason:

Let the arrival times between t1t_1 and t2t_2 be S1,cdots,SNS_1,cdots, S_N, where N=N(t2)N(t1)N=N(t_2)-N(t_1). Then W1=S1tW_1=S_1-t. W2=S2S1W_2=S_2-S_1, \cdots are i.i.d r.v's with distribution Exp(λ)Exp(\lambda)

N=nW1+W2++Wnt2t1W1+W2++Wn+Wn+1>t2t1\begin{aligned} N=n\Leftrightarrow &W_1+W_2+\cdots +W_n \leq t_2-t_1\\ &W_1+W_2+\cdots +W_n+W_{n+1} > t_2-t_1 \end{aligned}

Fact 5.3.1.2

If W1,,WnW_1,\cdots,W_n are i.i.d. r.v's following Exp(λ)Exp(\lambda), then W1++WnErlang(n,λ)W_1+\cdots+W_n\sim Erlang(n,\lambda)(a special type of GammaGamma)

c.d.f:F(x)=1k=1n11k!eλx(λx)kc.d.f: F(x)=1\sum_{k=1}^{n-1}\frac{1}{k!}e^{-\lambda x}(\lambda x)^k

Thus,

P(W1+W2++Wnt2t1)=1k=0n11k!eλ(t2t1)(λ(t2t1))k\begin{aligned} &\mathbb{P}(W_1+W_2+\cdots+W_n\leq t_2-t_1)\\ &=1-\sum_{k=0}^{n-1}\frac{1}{k!}e^{-\lambda(t_2-t_1)}(\lambda(t_2-t_1))^k \end{aligned}

P(W1+W2++Wn+Wn+1t2t1)=1k=0n1k!eλ(t2t1)(λ(t2t1))k\begin{aligned} &\mathbb{P}(W_1+W_2+\cdots+W_n+W_{n+1}\leq t_2-t_1)\\ &=1-\sum_{k=0}^{n}\frac{1}{k!}e^{-\lambda(t_2-t_1)}(\lambda(t_2-t_1))^k \end{aligned}

P(N=n)=P(W1++Wnt2+t1)P(W1++Wn+1t2t1)=1n!eλ(t2t1)(λ(t2t1))n\begin{aligned} \mathbb{P}(N=n) &= \mathbb{P}(W_1+\cdots+W_n\leq t_2+t_1) - \mathbb{P}(W_1+\cdots+W_{n+1}\leq t_2-t_1) \\ &=\frac{1}{n!}e^{-\lambda(t_2-t_1)}(\lambda(t_2-t_1))^n \end{aligned}

In particular, N(t)=N(t)N(0)Poi(λt)N(t)=N(t)-N(0)\sim Poi(\lambda t)

E(N(1))=λintensity: expected number of arrivals in one unit of time\mathbb{E}(N(1))=\lambda \quad \leftarrow \text{intensity: expected number of arrivals in one unit of time}

5.3.1.3. Combining and Thining of Poisson Process

Theorem:

{N1(t)}Poi(λ1t)\{N_1(t)\}\sim Poi(\lambda_1 t)

{N2(t)}Poi(λ2t)\{N_2(t)\}\sim Poi(\lambda_2 t)

{N1(t)}\{N_1(t)\} and {N2(t)}\{N_2(t)\} are independent

Let N(t)=N1(t)+N2(t)N(t)=N_1(t)+N_2(t), then {N(t)}Poi((λ1+λ2)t)\{N(t)\}\sim Poi((\lambda_1+\lambda_2)t)

The combined Poisson Process is still a Poisson Process, with intensity being the sum of intensities.

Reason: Memoryless property, and

min(W1,W2)W1Exp(λ1)W2Exp(λ2)W1W2 min(W_1, W_2) \\ W1\sim Exp(\lambda_1)\\ W2\sim Exp(\lambda_2)\\ W_1\perp\!\!\!\perp W_2

\Rightarrow the combined process is the counting process of events with interarrival time following Exp(λ1+λ2Exp(\lambda_1+\lambda_2

Thinning

Let {N(t)}Poi(λt)\{N(t)\}\sim Poi(\lambda t). Each arrival (customer) is labeled as type 1 or type 2, with probability pp and 1p1-p, independently from others.

Let N1(t)N_1(t) and N2(t)N_2(t) be the number of customers of type 1and type 2 respectively, who arrived before time tt. Then

{N1(t)Poi(pλt){N2(t)Poi((1p)λt)and {N1(t)}{N2(t)}\begin{aligned} \{N_1(t)&\sim Poi(p\lambda t)\\ \{N_2(t)&\sim Poi((1-p)\lambda t)\\ \text{and } \{N_1(t)\}& \perp\!\!\!\perp \{N_2(t)\} \end{aligned}

Reason: This is the inverse procedure of combining two independent Poisson processes into one Poisson process

5.3.1.4 Order Statistics Property

Let X1,,XnX_1,\ldots,X_n be i.i.d. r.v's. The order statistics of X1,,XnX_1, \ldots, X_n are random variables defined as follows.

X(1)=min{X1,,Xn}X(2)=2nd smallest amongX1,,XnX(n)=max{X1,,Xn} \begin{aligned} &X_{(1)} = min\{X_1,\ldots,X_n\} \\ &X_{(2)} = \text{2nd smallest among} X_1,\ldots,X_n \\ &\quad\quad\quad\quad\quad\vdots \\ &X_{(n)} = max\{X_1,\ldots,X_n\} \end{aligned}

In orher words, X(1),,X(n)X_{(1)},\ldots,X_{(n)} are such that {X(1),,X(n)}={X1,,Xn}\{X_{(1)},\ldots,X_{(n)}\}=\{X_1,\ldots,X_n\} and X(n)X(2)X(n)X_{(n)}\leq X_{(2)}\leq\cdots\leq X_{(n)}

Thus, let {N(t(}Poi(λt)\{N(t(\}\sim Poi(\lambda t). Condition on N(t)=nN(t)=n, the points/arrivals of NN in [0,t][0,t] are distributed as the order statistics of nn i.i.d. uniform r.v's on [0,t][0,t]

That is

(S1,,SnN(t)=n)=d(U(1),,U(n))(S_1,\ldots, S_n | N(t)=n)\stackrel{d}{=} (U_{(1)},\ldots, U_{(n)})

where, U(1),,U(n)U_{(1)},\ldots,U_{(n)} are the order statistics of U1,,UniidUnif[0,t]U_1, \ldots, U_n\stackrel{iid}{\sim}Unif[0,t]

Reason:

fS1{N(t)=1}(s)=fS1(s)P(W2>ts)P(N(t)=1)fS1(s)P(W2Exp(λ)>ts)=λeλseλ(ts)=λeλtconst w.r.t. s \begin{aligned} f_{S_1|\{N(t)=1\}}(s) &=\frac{f_{S_1}(s)\mathbb{P}(W_2>t-s)}{\mathbb{P}(N(t)=1)} \\ &\propto f_{S_1}(s)\mathbb{P}(\underbrace{W_2}_{Exp(\lambda)}>t-s) \\ &= \lambda e^{-\lambda s} e^{-\lambda(t-s)} \\ &= \underbrace{\lambda e^{-\lambda t}}_{\text{const w.r.t. s}} \end{aligned}

S1{N(t)=1}Unif[0,t]\Rightarrow S_1|\{N(t)=1\}\sim Unif[0,t]

As a result of the order statistics property, we have preposition

N(s){N(t)=n}Bin(n,st)for stN(s)|\{N(t)=n\}\sim Bin(n,\frac{s}{t})\quad\quad\text{for } s\leq t

Reason: Given N(t)=nN(t)=n, then

N(s)=#{Si:Sis,i=1,2,,n}Since {U(i)} is a=#{U(i):U(i)s,i=1,2,,n} permutation of {Ui}=#{Ui:Uis,i=1,2,,n}\begin{aligned} N(s) &= \#\{S_i : S_i\leq s, i=1,2,\ldots,n\} \\ \text{Since $\{U_{(i)}\}$ is a} &= \#\{U_{(i)} : U_{(i)}\leq s, i=1,2,\ldots,n\} \\ \text{ permutation of $\{U_i\}$} &= \#\{U_i : U_i\leq s, i=1,2,\ldots,n\} \\ \end{aligned}

UiiidUnif[0,t]U_i\stackrel{iid}{\sim}Unif[0,t]

P(Uis)=sti=1,,n\mathbb{P}(U_i\leq s) = \frac{s}{t}\quad\quad i=1,\ldots,n

N(s){N(t)=n}Bin(n,st)\Rightarrow N(s)|\{N(t)=n\}\sim Bin(n,\frac{s}{t})

6. Continuous-Time Markov Chain

6.1. Definitions and Structures

Definition 6.1.1. Continuous-time Stochastic Process

A continuous-time stochastic process {X(t)}t0\{X(t)\}_{t\geq 0} is called a continuous-time Markov Chain (CTMC), if its state space is at most countable, and it satisfies the continuous-time Markov property:

P(X(tm)=jX(tm1=i,X(tm2)=im2,,X(t1)=i1=P(X(tm)=jX(tm1=i))for any m,t1<t2<<tm,i1,,im2,i,jS\begin{aligned} &\mathbb{P}(X(t_m)=j|X(t_{m-1}=i, X(t_{m-2})=i_{m-2},\ldots, X(t_1)=i_1 \\ =& \mathbb{P}(X(t_m)=j|X(t_{m-1}=i))\\ &\quad\quad \text{for any }m, t_1<t_2<\cdots<t_m, i_1,\ldots, i_{m-2}, i, j\in S \end{aligned}

As DTMC, typically S={0,,m}S=\{0,\ldots,m\} or {1,,m}\{1,\ldots,m\} or {0,±1,±2,ldots}\{0,\pm1,\pm2,ldots\}

Time is continuous, but the state space is discrete \Rightarrow Process will "jump" between states

It can be regarded as a random step function.

Therefore, we need to specify two things:

  1. When the jumps happen? \Leftrightarrow How long the process stays in a state?
  2. When it jumps, where it jumps to.

A CTMC is fully characterized by {λi}iS\{\lambda_i\}_{i\in S} and Q={qij}i,jSQ=\{q_{ij}\}_{i,j\in S}

To conclude, a CTMC stays in a state ii for an exponential random time TiT_i; then jumps to another state jj with probability qijq_{ij}, then stays in jj for an exponential random time TjT_j, \cdots , all the jumps and times spent in different states are independent.

Example 6.1.1.1.

We have seen that the Poisson process satisfy the continuous-time Markov property \Rightarrow it is a CTMC

λi=λiS={0,1,2,}\lambda_i=\lambda \quad\quad i\in S=\{0,1,2,\cdots\}

(interarrival times do not depend on current state. \Rightarrow time spent in different states are i.i.d.)

qi,i+i=1qij=0 otherwise(j=i+1)q_{i,i+i} = 1\quad\quad q_{ij} = 0 \text{ otherwise}(j\cancel{=}i+1)

6.2. Generator Matrix

Similar to the discrete-time case, we have the transition probability at time tt.

Pij(t)=P(X(t)=jX(0)=i)=P(X(t+s)=jX(s)=i)assume the MC is time-homogeneous \begin{aligned} P_{ij}(t) &= \mathbb{P}(X(t)=j|X(0)=i) \\ &= \mathbb{P}(X(t+s)=j|X(s)=i) \quad\quad\text{assume the MC is time-homogeneous} \end{aligned}

and matrix

P(t)={Pij(t)}i,jSP(t)=\{P_{ij}(t)\}_{i,j\in S}

P(t)=(p00(t)p01(t)p10(t)p11(t)) P(t)=\begin{pmatrix} p_{00}(t) & p_{01}(t) & \cdots \\ p_{10}(t) & p_{11}(t) & \cdots \\ \vdots & \vdots & \end{pmatrix}

The C-K equation still holds

p(t+s)=P(t)P(s)p(t+s) = P(t)\cdot P(s)

Proof: i,jS\forall i,j\in S

Pij(t+s)=P(X(t+s)=jX(0)=i)=kSP(X(t+s)=jX(t)=k,X(0)=i)P(X(t)=kX(0)=i)=kSP(X(s)+jX(o)=k)P(X(t)=kX(0)=i)=kSPkj(s)Pik(t)=(P(t)P(s))ij \begin{aligned} P_{ij}(t+s) &= \mathbb{P}(X(t+s)=j|X(0)=i) \\ &= \sum_{k\in S}\mathbb{P}(X(t+s)=j|X(t)=k,\cancel{X(0)=i})\cdot \mathbb{P}(X(t)=k|X(0)=i) \\ &= \sum_{k\in S}\mathbb{P}(X(s)+j|X(o)=k)\cdot\mathbb{P}(X(t)=k|X(0)=i) \\ &= \sum_{k\in S}P_{kj}(s)\cdot P_{ik}(t)\\ &=(P(t)\cdot P(s))_{ij}\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\blacksquare \end{aligned}

Note that we have

P(0)=IP(0) = I

(Pii(0)=P(X(0)=iX(0)=i)=1,Pij(0)=0 for j=i)(P_{ii}(0)=\mathbb{P}(X(0)=i|X(0)=i) = 1, P_{ij}(0)=0\text{ for } j\cancel{=}i)

and

limt0+P(t)=I\lim_{t\rightarrow0^+} P(t)=I

Actually, we have the following stronger result:

R:=limh0+P(h)P(0)h=limh0+P(h)IhR:= \lim_{h\rightarrow 0^+}\frac{P(h)-P(0)}{h}=\lim_{h\rightarrow0^+}\frac{P(h)-I}{h}

exists, and is called the (infinitesimal) generator matrix of {X(t)}t0\{X(t)\}_{t\geq 0}

Entry-wise:

Rij=limh0+Pij(h)Pij(0)h={limh0+Pii(h)1h0j=ilimh0+Pii(h)h0j=i R_{ij}=\lim_{h\rightarrow0^+}\frac{P_{ij}(h)-P_{ij}(0)}{h} = \begin{cases} \lim_{h\rightarrow0^+}\frac{P_{ii}(h)-1}{h}\leq 0\quad j = i\\ \lim_{h\rightarrow0^+}\frac{P_{ii}(h)}{h}\geq 0\quad\quad j \cancel{=}i \end{cases}

Relation between RR and {λi}iS\{\lambda_i\}_{i\in S} and Q={qij}i,jSQ=\{q_{ij}\}_{i,j\in S}

Rii=λi,Rij=λiqijj=iR_{ii}=-\lambda_i\quad,\quad R_{ij}=\lambda_iq_{ij}\quad\quad j\cancel{=}i

Reason:

Rii=limh0+Pii(h)1h=limh0+P(Ti>h)1h \begin{aligned} R_{ii}=\lim_{h\rightarrow0^+}\frac{P_{ii}(h)-1}{h} = \lim_{h\rightarrow0^+}\frac{\mathbb{P}(T_i>h) - 1}{h} \end{aligned}

Where TiT_i is the random time the process stays in ii. The equality holds because when hh is very small, the probability of having two or more jumps in time hh is negligible.

P(X(h)=iX(0)=i)=P(Ti>h)+o(h)having at least 2 jumps and back to i\mathbb{P}(X(h)=i|X(0)=i) = \mathbb{P}(T_i>h) +o(h)\leftarrow\text{having at least 2 jumps and back to $i$}

a(h)=o(h) if limh0a(h)h=0* a(h)=o(h) \text{ if } \lim_{h\rightarrow 0}\frac{a(h)}{h}=0

Rii=limh0+P(TiExp(λi)>h)h=limh0+eλih1j=limh0+eλiheλi0j=deλidhh=0=λieλihh=0=λiRij=limh0+Pij(h)h=limh0+P(X(h)=jX(0)=i)hOnly one jump happens two =limh0+(P(Ti<h,X(Ti)=j)hjumps happen uegligeable=qijlimh0+P(Ti<h)h=qijlimh0+1eλihh=qijlimh0+eλi0eλihh=qij(deλihdhh=0)=qijλi\begin{aligned} R_{ii} &=\lim_{h\rightarrow 0^+}\frac{\mathbb{P}(\overbrace{T_i}^{Exp(\lambda_i)}>h)}{h}\\ &=\lim_{h\rightarrow 0^+}\frac{e^{-\lambda_i h}-1}{j}\\ &=\lim_{h\rightarrow 0^+}\frac{e^{-\lambda_i h}-e^{-\lambda_i \cdot 0}}{j}\\ &= \frac{de^{-\lambda_i}}{dh}\Big|_{h=0}\\ &= -\lambda_i e^{-\lambda_ih}|_{h=0}=-\lambda_i \\ \end{aligned}\\ \begin{aligned} R_{ij}=\lim_{h\rightarrow 0^+}\frac{P_{ij}(h)}{h} &=\lim_{h\rightarrow0^+}\frac{\mathbb{P}(X(h)=j|X(0)=i)}{h}\\ \text{Only one jump happens two } &=\lim_{h\rightarrow0^+}(\frac{\mathbb{P}(T_i<h,X(T_i)=j)}{h} \\ \text{jumps happen$\rightarrow$ uegligeable} &=q_{ij}\lim_{h\rightarrow0^+}\frac{\mathbb{P}(T_i<h)}{h} \\ &=q_{ij}\lim_{h\rightarrow0^+}\frac{1-e^{-\lambda_ih}}{h} \\ &=q_{ij}\lim_{h\rightarrow0^+}\frac{e^{-\lambda_i0}-e^{-\lambda_ih}}{h} \\ &= q_{ij}(\frac{-de^{-\lambda_ih}}{dh}\Big|_{h=0}) \\ &= q_{ij}\lambda_i \end{aligned} \\

Thus, we conclude that

Rii=λi,Rij=λiqij]R_{ii}=-\lambda_i,\quad R_{ij}=\lambda_iq_{ij]}

Note that Rii0R_{ii}\leq 0, Rij0,R_{ij}\geq 0, j=ij\cancel{=}i

iSRij=Rii+iS,j=iRij=λi+iS,j=iλiqijjS,j=iqij=1=λi+λi=0\begin{aligned} \sum_{i\in S}R_{ij} &= R_{ii}+\sum_{i\in S, j\cancel{=}i}R_{ij} \\ &= -\lambda_i+\sum_{i\in S, j\cancel{=}i}\lambda_iq_{ij} \quad \leftarrow \sum_{j\in S, j\cancel{=}i}q_{ij}=1 \\ &= -\lambda_i+\lambda_i \\ &= 0 \end{aligned}

The row sums of RR are 0

R=\begin{pmatrix} -\lambda_0 & \lambda_0q_{01} & \lambda_0q_{02} & \cdots \\ -\lambda_1q_{10} & -\lambda_1 & \lambda_1q_{12} & \cdots \\ -\lambda_2q_{20} & \lambda_2q_{21} & -\lambda_2 & \cdots \\ \vdots &\vdots&\vdots&\ddots \end{pmatrix} $$-\lambda_0 $-\lambda_i$ : probability flow / rate going out of state $i$ $\lambda_iq_{ij}$ : probability flow / rate going from $i$ to $j$ From $R$ to $\{\lambda_i\}$ and $Q$: $\quad$ If we know $R$, then $$ \lambda_i=-R_{ii}

qij=RijRiij=iq_{ij} = \frac{-R_{ij}}{R_{ii}}\quad j\cancel{=}i

Thus, there is a 111-1 relation between {λi}iS+{qij}i,jS\{\lambda_i\}_{i\in S} + \{q_{ij}\}_{i,j\in S} and R={Rij}i,jSR=\{R_{ij}\}_{i,j\in S}

\Rightarrow the generator RR itself also fully characterizes the transitional behaviour of the CDMC.

Conclusion: {λi}+{qij}i,jS\{\lambda_i\}+\{q_{ij}\}_{i,j\in S} and {Rij}i,jS\{R_{ij}\}_{i,j\in S} are two sets of parameters that can be used to specify a CDMC.

Example 6.2.1. Poisson Process

λi=λQ=(010101)\lambda_i=\lambda\\ Q=\begin{pmatrix} 0 & 1 \\ & 0 & 1 \\ & & 0 & 1 \\ \\ & & & \ddots &\ddots \\ \end{pmatrix}

Rii=λi=λi=0,1,Rij=λiqij={λj=i+10j=i,i+1R=(λλλλ)\begin{aligned} \Rightarrow &R_{ii}=-\lambda_i=-\lambda \quad\quad i=0,1,\ldots\\ &R_{ij}=\lambda_iq_{ij}=\begin{cases} \lambda \quad j=i+1 \\ 0 \quad j\cancel{=}i, i+1 \end{cases} \end{aligned}\\ R=\begin{pmatrix} -\lambda & \lambda \\ &-\lambda & \lambda \\ \\ & & \ddots &\ddots \\ \end{pmatrix}

Example 6.2.2. 3-tables in a restaurant

Parties of customers arrive according to a Poisson Process with intensity λ\lambda.

If there are free tables \rightarrow the party is served, and spend an exponential amount of time with average 1μ\frac{1}{\mu} (parameter μ\mu)

If there is no free table \rightarrow the party leaves immediately

Let X(t)X(t) be the number of occupied tables at time tt \Rightarrow S={0,1,2,3}S=\{0,1,2,3\}

Since all the interarrival times and service times are exponential and independent, the process {X(t)}\{X(t)\} is a CTMC.

Find λi\lambda_i and qijq_{ij}:

For i=0i=0:

For i=1i=1:

Recall a property of exponential

min(TExp(λ),SExp(μ))Exp(λ+μ)the distribution of the time spent in the current statemin(\underbrace{T}_{Exp(\lambda)}, \underbrace{S}_{Exp(\mu)})\sim Exp(\lambda +\mu) \leftarrow \text{the distribution of the time spent in the current state}

P(T<S)=λλ+μq12=P(T<S))=λλ+μq10=1q12=μλ+μq13=0 \begin{aligned} \mathbb{P}(T<S ) &= \frac{\lambda}{\lambda+\mu} \\ q_{12}=\mathbb{P}(T<S)) &= \frac{\lambda}{\lambda+\mu} \\ q_{10}=1-q_{12}&=\frac{\mu}{\lambda+\mu} \\ q_{13} &= 0 \end{aligned}

Thus,

Similarly, when 2 tables are occupied.

\Rightarrow min(T,S1,S2)Exp(λ+2μ)min(T, S_1, S_2)\sim Exp(\lambda +2 \mu)

P(T<S1,S223)=λλ+2μ\mathbb{P}(\underbrace{T<S_1,S_2}_{2\rightarrow 3}) = \frac{\lambda}{\lambda+2\mu}

Thus,

Finally,

Q=(0100μλ+μ0λλ+μ002μλ+2μ0λλ+2μ0010)Q=\begin{pmatrix} 0 & 1 & 0 & 0 \\ \frac{\mu}{\lambda+\mu} & 0 & \frac{\lambda}{\lambda+\mu} & 0 \\ 0 & \frac{2\mu}{\lambda+2\mu} & 0 & \frac{\lambda}{\lambda+2\mu} \\ 0 & 0 & 1 & 0 \\ \end{pmatrix}\\

Rii=λ1,Rij=λiqijR_{ii}=-\lambda_1, \quad R_{ij}=\lambda_i q_{ij} \\

R=(λλ00μ(λ+μ)λ002μ(λ+2μ)μ003μ3μ)R=\begin{pmatrix} -\lambda & \lambda & 0 & 0\\ \mu & -(\lambda +\mu) & \lambda & 0 \\ 0 & 2\mu & -(\lambda +2\mu) & \mu \\ 0 & 0 & 3\mu & -3\mu \end{pmatrix}

We see that RR was simpler form than QQ. Actually Riji=jR_{ij}\quad i\cancel{=}j directly corresponds to the "rate" by which the process moves from state ii to jj. Thus, in practice, we often directly model RR rather than getting {λi}iS\{\lambda_i\}_{i\in S} and Q={qij}i,jSQ=\{q_{ij}\}_{i,j\in S}

6.3. Classification of States

The matrix QQ is a transition matrix of a DTMC (qij0,jSqij=1q_{ij}\geq 0,\quad \sum_{j\in S}q_{ij}=1), It contains all the information about the state changes, but "forget" the time.

Since accessibility, communication, irreducibility, recurrence/transience are only related to the change of states, not the time, these properties will be the same for the CTMC and its discrete skeleton.

For a CTMC {X(t)}t0\{X(t)\}_{t\geq 0}, we call "state jj is accessible from state ii", "ii and jj communicate", "the process is irreducible", "ii is recurrent/transient" if and only if this is the case for its discrete skeleton.

6.3.1. Positive / Null Recurrence

Note that since positive/null recurrence do involve the (expected) amount of time, we can indeed have different results for a CTMC and its discrete skeleton.

Let RiR_i be the amount of (continuous, random) time the MC (re)visits state ii.

A state ii is called positive recurrent, if it is recurrent, and E(RiX(0)=i)<\mathbb{E}(R_i|X(0)=i)<\infty. It is called null recurrent, if it is recurrent and E(RiX(0)=i)=\mathbb{E}(R_i|X(0)=i)=\infty

As in the discrete-time case, the positive recurrence, null recurrence and transience are class property

6.4. Stationary Distribution

Definition 6.4.1. Stationary Distribution

A distribution π=(π0,π1,)\underline{\pi}=(\pi_0,\pi_1,\cdots) is called a stationary distribution of a CTMC {X(t)}t0\{X(t)\}_{t\geq 0} with generator RR is it satisfies:

  1. πR=0(0,0,)\underline{\pi}\cdot R=0 \rightarrow (0,0,\cdots)
  2. iSπi=1\sum_{i\in S}\pi_i=1\quad (π1=1\underline{\pi}\cdot 1\!\!\!\!\perp = 1)

Q: Why such a π\underline{\pi} is called stationary?

A: Assume the process starts from the initial distribution α(0)=π\underline{\alpha^{(0)}}=\underline{\pi}:

P(X(0)=i)=πi\mathbb{P}(X(0)=i)=\pi_i

Then the distribution at time tt is given by

α(t)=α(0)P(t)=πP(t)\underline{\alpha}^{(t)}=\underline{\alpha}^{(0)}\cdot P(t)=\underline{\pi}\cdot P(t)

Reason:

αj(t)=P(X(t)=j)=iSP(X(t)=jX(0)=i)P(X(0)=i)=iSPij(t)αi(0)=(α(i)P(t))j \begin{aligned} \alpha_j^{(t)} &= \mathbb{P}(X(t)=j)\\ &= \sum_{i\in S}\mathbb{P}(X(t)=j|X(0)=i)\mathbb{P}(X(0)=i)\\ &= \sum_{i\in S}P_{ij}(t)\cdot \alpha_i^{(0)}\\ &=(\underline{\alpha}^{(i)}\cdot P(t))_j\\ \end{aligned}

ddt((α)(t))=ddt(πP(t))=π(ddtP(t))c-k equation=πlimn0+P(t+n)p(t)t=πlimn0+P(n)P(t)P(t)n=π(limn0+P(n)In)P(t)0=πRP(t)=(0)\begin{aligned} \Rightarrow \frac{d}{dt}(\underline(\alpha)^{(t)} ) &= \frac{d}{dt}(\underline{\pi}\cdot P(t)) \\ &= \underline{\pi}(\frac{d}{dt} P(t)) \\ \text{c-k equation} &= \underline{\pi}\lim_{n\rightarrow 0^+}\frac{P(t+n)-p(t)}{t} \\ &= \underline{\pi}\lim_{n\rightarrow 0^+}\frac{P(n)P(t)-P(t)}{n} \\ &= \underline{\pi}(\lim_{n\rightarrow 0^+}\frac{P(n) - I}{n})P(t) \\ \underline{0} &= \underline{\pi}\cdot R\cdot P(t) \\ &= \underline(0) \end{aligned}

α(t)\Rightarrow \underline{\alpha}(t) is a constant (vector)

In other words, the distribution of X(t)X(t) will not change over time, if the MC start from the stationary distribution.

Fact 6.4.1. Stationary Distribution

If a CTMC starts from a stationary distribution, then its distribution will never change.

Remark 6.4.1. Kolmogorov's Backward Equation

In the above derivation, we also see that

ddt(P(t))=P(t)=RP(t)\frac{d}{dt}(P(t)) = P'(t)=R\cdot P(t)

This is called the Kolmogorov's Backward Equation

6.4.1. Basic Limit Theorem for CTMC

Let {X(t)}t0\{X(t)\}_{t\geq 0} be an irreducible, recurrent CTMC. Then

limtPij(t)=:πj=E(Tj)E(RjX(0)=j)=1/λiE(RjX(0)=j)\lim_{t\rightarrow\infty}P_{ij}(t)=:\pi_j'=\frac{\mathbb{E}(T_j)}{\mathbb{E}(R_j|X(0)=j)} = \frac{1/\lambda_i}{\mathbb{E}(R_j|X(0)=j)}

In addition, the MC is positive recurrent if and only if an unique stationary distribution exists. In this case, the stationary distribution is π=(π0,π1,)\underline{\pi}'=(\pi_0',\pi_1',\cdots)

Remark 6.4.1.1

E(Tj)E(RjX(0)=j)=long-run fraction of time spent in state j\frac{\mathbb{E}(T_j)}{\mathbb{E}(R_j|X(0)=j)} = \text{long-run fraction of time spent in state $j$}

Thus, πj\pi_j' is also the long-run fraction of time that the process spends in stat jj.

6.5. Birth and Death Processes

Definition 6.5.1. Birth and Death Process

A Birth and Death Process is a CTMC such that S={0,1,,M}S=\{0,1,\cdots, M\}, or S={0,1,}S=\{0,1,\cdots\}, and qij=0\underline{q_{ij}=0} if ji>1\underline{|j-i|>1}.

The process can only change to neighbouring states:

qi,i1+qi,i+1=1,i1q_{i,i-1}+q_{i,i+1} = 1,\quad i\geq 1

q01=1q_{01} = 1

For a birth and death process, we use a new set of parameters:

Denote

λi=Ri,i+1i=0,1,μi=Ri,i1i=1,2,R=(λ0λ0μ1(λ1+μ1)λ1μ2(λ2+μ2)λ2μ3(λ3+μ3)λ3)Rii=(λi+μi)i1R00=λ0\begin{aligned} &\lambda_i = R_{i,i+1}\quad\quad i=0,1,\cdots\\ &\mu_i = R_{i,i-1}\quad\quad i=1,2,\cdots \end{aligned}\\ R=\begin{pmatrix} -\lambda_0 & \lambda_0 \\ \mu_1&-(\lambda_1+\mu_1)&\lambda_1 \\ & \mu_2&-(\lambda_2+\mu_2)&\lambda_2 \\ &&\mu_3&-(\lambda_3+\mu_3)&\lambda_3 \\\\ &&\quad\quad\quad\ddots&\quad\quad\quad\ddots&\quad\quad\quad\ddots \end{pmatrix}\\ \Rightarrow \begin{aligned} &R_{ii} =-(\lambda_i+\mu_i) \quad i\geq 1 \\ &R_{00} =\lambda_0 \end{aligned}

λi\lambda_i's are called the "birth rates" (population+1population + 1)

μi\mu_i's are called "death rates" (population1population - 1)

Example 6.5.1. Previous Example of Restaurant

R=(λλμ(μ+λ)λ2μ(2μ+λ)λ3μ3μ)R=\begin{pmatrix} -\lambda &\lambda \\ \mu & -(\mu+\lambda) & \lambda\\ & 2\mu & -(2\mu+\lambda) & \lambda\\ &&3\mu & -3\mu\\ \end{pmatrix}

This is a birth and death process.

In general, consider a M/M/S queueing system.

X(t)=X(t) = # of customers in the system at time tt

Thus, for M/M/S queue,

λi=λμi={iμissμi>s\lambda_i=\lambda\\ \mu_i=\begin{cases} i\cdot \mu\quad\quad i\leq s\\ s\cdot \mu\quad\quad i > s \end{cases}

Example 6.5.2. A Population Model

Each individual gives birth to an offspring with exponential rate λ\lambda. (i.e. the waiting time until it gets the next offspring Exp(λ)\sim Exp(\lambda) ). Each individual dies with exponential rate μ\mu. Let X(t)X(t) be the population size at time tt.

The time until the next (potential) birth is the smallest amount the ii i.i.d. birth time. Exp(iλ)\sim Exp(i\cdot \lambda)

Thus the birth rate λi=iλi=0,1,2,\lambda_i=i\cdot \lambda\quad i=0,1,2,\cdots

Similarly, the death rate μi=iμi=1,2,\mu_i=i\mu\quad i=1,2,\cdots

R=(00μ(λ+μ)λ2μ2(λ+μ)2λ):0 is absorbingR=\begin{pmatrix} 0 & 0 \\ \mu & -(\lambda+\mu ) & \lambda \\ & 2\mu & -2(\lambda+\mu) & 2\lambda \\ && \quad\quad\ddots&\quad\quad\ddots& \end{pmatrix}\\ *: \text{0 is absorbing}

If we do need to know {λi}\{\lambda_i\} and QQ

λi=Rii=i(λ+μ)i0qi,i+1=iλi(λ+μ)=λλ+μi1()qi,i1=iμi(λ+μ)=μλ+μ11\lambda_i = -R_{ii}=i(\lambda+\mu)\quad i\geq 0 \\ q_{i,i+1} = \frac{i\lambda}{i(\lambda+\mu)} = \frac{\lambda}{\lambda+\mu} \quad i\geq1 \quad (*) \\ q_{i,i-1}=\frac{i\mu}{i(\lambda+\mu)}=\frac{\mu}{\lambda+\mu} \quad 1\geq 1\\

()q0,1=1. arbitrary since λ0=0(*)q_{0,1}=1. \text{ arbitrary since } \lambda_0=0

Q=(01μλ+μ0λλ+μμλ+μ0λλ+μ)Q=\begin{pmatrix} 0 & 1\\\\ \frac{\mu}{\lambda+\mu} & 0 & \frac{\lambda}{\lambda+\mu}\\\\ &\frac{\mu}{\lambda+\mu} & 0 & \frac{\lambda}{\lambda+\mu}\\\\ && \quad\quad \ddots& \quad\quad \ddots \end{pmatrix}

We can further add immigration to the system. Individuals are added to the population according to a Poisson process with intensity α\alpha.

This will not change the "rate" from ii to i1i-1

The time until the next increase in population is now the minimum of two r.v's following Exp(iλ)Exp(i\lambda) (births) and Exp(α)Exp(\alpha) (immigration).

\Rightarrow rate from ii to i+1i+1 becomes iλ+αi\lambda+\alpha

R=(ααμ(λ+μ+α)λ+α2μ(λ+2μ+α)λ+α)R=\begin{pmatrix} -\alpha & \alpha \\ \mu & -(\lambda+\mu+\alpha) & \lambda + \alpha \\ &2\mu & -(\lambda+2\mu+\alpha) & \lambda + \alpha \\\\ && \quad\quad\ddots& \quad\quad\ddots \end{pmatrix}

()λi=Ri,i+1=iλ+αi=0,1,μi=Ri,i1=iμi=1,2,() not the "biological" birth rate, but the total rate by which the process goes from i to i+1 \begin{aligned} (*) \quad&\lambda_i = R_{i,i+1} = i\lambda +\alpha\quad\quad i=0,1,\cdots\\ &\mu_i = R_{i,i-1} = i\mu \quad\quad\quad\quad i = 1,2,\cdots\\ \end{aligned}\\ (*) \text{ not the "biological" birth rate, but the total rate by which the process goes from $i$ to $i+1$}

{λi}\{\lambda_i\} and QQ will change accordingly.

Note that state 0 is no longer absorbing due to the immigration.

As we see, there are two main types of birth and death processes: queueing system and population model. THe key difference between them is that the birth rate in the queueing system is typically a constant (does not depend on the current state ii), while the birth rate in population model increases as ii increases.

6.5.1. Stationary Distribution of a Birth and Death Process

{πR=0(1)π1=1(2) \begin{cases} \underline{\pi}\cdot R = \underline{0}\quad (1)\\ \underline{\pi}\cdot 1\!\!\!\!\perp = \underline1 \quad (2)\\ \end{cases}

(π0,π1,)(λ0λ0μ1(λ1+μ1)λ1μ2(λ2+μ2)λ2)(\pi_0,\pi_1,\cdots) \cdot \begin{pmatrix} -\lambda_0 & \lambda_0 \\ \mu_1 & -(\lambda_1+\mu_1)&\lambda_1\\ &\mu_2 & -(\lambda_2+\mu_2)&\lambda_2\\\\ && \quad\quad\ddots& \quad\quad\ddots& \quad\quad\ddots \end{pmatrix}

(1)λ0π0+μ1π1=0π1=λ0μ1π0λ0π0(λ1+μ1)μ1+μ2π2=0 (1) \Rightarrow \begin{aligned} &-\lambda_0\pi_0+\mu_1\pi_1 = 0 \Rightarrow\pi_1=\frac{\lambda_0}{\mu_1}\pi_0\\ &\lambda_0\pi_0-(\lambda_1+\mu_1)\mu_1+\mu_2\pi_2 = 0 \end{aligned}

Add this to the first equation, we have

λ1π1+μ2π2=0π2=λ1μ2π1-\lambda_1\pi_1+\mu_2\pi_2 = 0 \Rightarrow\pi_2=\frac{\lambda_1}{\mu_2}\pi_1

In general, adding the first ii equations, we have

λ0π0+μ1π1=0-\lambda_0\pi_0+\mu_1\pi_1= 0

λ0π0(λ1+μ1)π1+μ2π2=0\lambda_0\pi_0-(\lambda_1+\mu_1)\pi_1+\mu_2\pi_2=0

\vdots

λi2πi2(λi1+μi1+μiπi)=0\lambda_{i-2}\pi_{i-2}-(\lambda_{i-1}+\mu_{i-1}+\mu_i\pi_i)=0

λi1πi1+μiπi=0-\lambda_{i-1}\pi_{i-1}+\mu_i\pi_i=0

πi=λi1μiπi1==λ0λ1λi1μ1μ2μiπ0 \begin{aligned} \Rightarrow \pi_i &= \frac{\lambda_{i-1}}{\mu_i}\pi_{i-1}\\ &=\cdots\\ &=\frac{\lambda_0\lambda_1\cdots\lambda_{i-1}}{\mu_1\mu_2\cdots\mu_i}\pi_0 \end{aligned}

Use (2)(2) to normalize

1=n=0πn=(1+n=1Πj=1nλj1μj1=\sum_{n=0}^\infty \pi_n = (1+\sum_{n=1}^{\infty}\Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j}

π0=11+n=1Πj=1nλj1μjπi=Πj=1iλj1μj1+n=1Πj=1nλj1μj\begin{aligned} \Rightarrow &\pi_0=\frac{1}{1+\sum_{n=1}^\infty\Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j}}\\ &\pi_i=\frac{\Pi_{j=1}^i\frac{\lambda_{j-1}}{\mu_j}}{1+\sum_{n=1}^\infty\Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j}} \end{aligned}

Thus, a stationary distribution exists(the MC is positive recurrent, assuming irreducible) if and only if

n=1Πj=1nλj1μj<\sum_{n=1}^\infty\Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j } < \infty

Example 6.5.1.1. M/M/S Queue (cont'd)

λi=λμi={iμissμi>s \lambda_i = \lambda\quad\quad\mu_i=\begin{cases} i\mu\quad i\leq s \\ s\mu\quad i>s \end{cases}

n=1Πj=1nλj1μj=λμ+λμλ2μ++λμλ2μλsμ+λμλ2μ(λsμ)2+λμλ2μ(λsμ)3+geometric series with ration λsμ\begin{aligned} &\sum_{n=1}^\infty \Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j}\\ &= \underbrace{\frac{\lambda}{\mu}+\frac{\lambda}{\mu}\cdot\frac{\lambda}{2\mu}+\cdots+\frac{\lambda}{\mu}\frac{\lambda}{2\mu}\cdots\frac{\lambda}{s\mu} + \frac{\lambda}{\mu}\frac{\lambda}{2\mu}\cdots(\frac{\lambda}{s\mu})^2+\frac{\lambda}{\mu}\frac{\lambda}{2\mu}\cdots(\frac{\lambda}{s\mu})^3+\cdots}_{\text{geometric series with ration $\frac{\lambda}{s\mu}$}} \end{aligned}

\Rightarrow The sum is finite if and only if λ<sμ\lambda<s\mu

\Rightarrow the process {X(t)}t0\{X(t)\}_{t\geq 0} is positive recurrent if and only if λarrival rate<sμmaximal (total) service rate\underbrace{\lambda}_{\text{arrival rate}} < \underbrace{s\mu}_{\text{maximal (total) service rate}}

Example 6.5.1.2. Population Model (with immigration)

λi=iλ+αμi=iμ\lambda_i=i\lambda+\alpha\quad\quad\mu_i=i\mu

n=1Πj=1nλj1μj=n=1Πj=1n(j1)λ+αjμ\sum_{n=1}^\infty \Pi_{j=1}^n\frac{\lambda_{j-1}}{\mu_j} = \sum_{n=1}^\infty\Pi_{j=1}^n\frac{(j-1)\lambda+\alpha}{j\mu}

limj(j1)λ+αjμ=λμ\lim_{j\rightarrow\infty}\frac{(j-1)\lambda+\alpha}{j\mu}=\frac{\lambda}{\mu}

If λ<μ\lambda<\mu, then n=1Πj=1n(j1)λ+αjμ<\quad\sum_{n=1}^\infty\Pi_{j=1}^n\frac{(j-1)\lambda+\alpha}{j\mu}<\infty by ratio test.

If λ>μ\lambda>\mu, then n=1Πj=1n(j1)λ+αjμ=\quad\sum_{n=1}^\infty\Pi_{j=1}^n\frac{(j-1)\lambda+\alpha}{j\mu}=\infty.

If λ=μ\lambda=\mu, then αλ=μ\quad\alpha \geq \lambda =\mu, the ratio (j1)λ+αjμ1\frac{(j-1)\lambda+\alpha}{j\mu}\geq 1 for all jj

\Rightarrow the terms in the summation is non-decreasing

\Rightarrow the sum=sum = \infty

If λ=μ,α<λ=μ\lambda=\mu,\quad \alpha <\lambda = \mu

Raabe-Duhamel's test: (not required content)

L:=limnn(anan+11){>1converge<1diverge=1inconclusive L:=\lim_{n\rightarrow\infty} n(\frac{a_n}{a_{n+1}}-1)\begin{cases} >1\quad\text{converge}\\ <1\quad\text{diverge}\\ =1\quad\text{inconclusive}\\ \end{cases}

Here:

L=limnn(nμ(n1)λ+α)=limnn(nλ(n1)λα(n1)λ+α)=limnn(λα(n1)λ+α)=λαλ<1\begin{aligned} L&=\lim_{n\rightarrow\infty}n(\frac{n\mu}{(n-1)\lambda+\alpha})\\ &=\lim_{n\rightarrow\infty}n(\frac{n\lambda-(n-1)\lambda-\alpha}{(n-1)\lambda+\alpha})\\ &=\lim_{n\rightarrow\infty}n(\frac{\lambda-\alpha}{(n-1)\lambda+\alpha})\\ &=\frac{\lambda-\alpha}{\lambda} <1 \end{aligned}

\Rightarrow the sum ==\infty

Conclusion

To sum up, the CTMC is positive recurrent if and only if λ<μ\lambda<\mu

Q: What happens if λ0=0\lambda_0=0? (0 is absorbing)

A:

The chain is not irreducible; typically two classes:

But the chain does not necessarily end up with being in state 00, because it can also have X(t)X(t)\rightarrow\infty. Whether this is a possibility depends on the relation between {λi}\{\lambda_i\} and {μi}\{\mu_i\}.